In this report we will work on a data set provided by Bellabeats, a high-tech firm that specializes on health tracking smart devices for women.

Data can be downloaded here https://www.kaggle.com/datasets/arashnic/fitbit.

The primary objective is to find trends, patterns or any indications that might lead us to useful insights for Bellabeats marketing analytics team.

Phase 1 - Ask

In the first step, we define the context, background, key players, problem and objectives of our case.

1.1 Background

Bellabeat is a high-tech manufacturer of beautifully-designed health-focused smart products for women since 2013. Inspiring and empowering women with knowledge about their own health and habits, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for females.

The co-founder and Chief Creative Officer, Urška Sršen is confident that an analysis of non-Bellebeat consumer data (ie. FitBit fitness tracker usage data) would reveal more opportunities for growth.

1.2 Business Task

Analyze FitBit fitness tracker data to gain insights into how consumers are using the FitBit app and discover trends for Bellabeat marketing strategy.

1.3 Business Objectives

  • What are the trends identified?
  • How could these trends apply to Bellabeat customers?
  • How could these trends help influence Bellabeat marketing strategy?

1.4 Deliverables

  • A clear summary of the business task
  • A description of all data sources used
  • Documentation of any cleaning or manipulation of data
  • A summary of analysis
  • Supporting visualizations and key findings
  • High-level content recommendations based on the analysis

1.5 Key Stakeholders

Urska Srsen: Bellabeat’s cofounder and Chief Creative Officer

Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team

Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy.

Phase 2 - Prepare

In the second step, we clarify the data sources and limitations of the data set.

2.1 Data Sources

I will use FitBit Fitness Tracker Data. Data is generated from a survey on Amazon Mechanical Turk between 12 March 2016 to 12 May 2016. This Kaggle data set contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore user habits.

2.2 Limitations of The Data

  • Data set only focuses on 30 users. It might be considered as a small sample size.
  • The data was generated in 2016. Which is 7 years ago at the time of analysis. The accuracy of collected data might be decreased due to changing user habits, activity levels and calorie intake.
  • Data was generated via survey from Amazon Mechanical Turk. Since it is a third party data set, we cannot assess the integrity, clarity, accuracy and transparency of the data.

2.3 Is Data ROCCC?

According to Google, reliable data sources are ROCCC which stands for Reliable, Original, Comprehensive, Current and Cited. For this data set;

  • Reliable — LOW — Not reliable as it only has 30 participants
  • Original — LOW — Third party provider (Amazon Mechanical Turk)
  • Comprehensive — MEDIUM — Parameters match most of Bellabeat products’ parameters
  • Current — LOW — Data is 5 years old and may not be relevant
  • Cited — LOW — Data collected from third party, hence unknown

Overall, the data set is considered low quality data and it is not recommended to produce business recommendations based on this data.

2.4 Data Filtering & Selection

Among 18 data sets provided in this case study, only 6 files are relevant to our research. These are;

  • dailyActivity_merged.csv
  • sleepDay_merged.csv
  • dailyCalories_merged.csv
  • dailySteps_merged.csv
  • weightLogInfo_merged.csv
  • dailyIntensities_merged.csv

Out of these 6 data sets, dailyActivity_merged.csv already includes crucial activity and calorie data from dailySteps_merged.csv, dailyIntensities_merged.csv and dailyCalories_merged.csv files so we disregards these files for further analysis.

Out of 3 remaining files, we check weightLogInfo_merged.csv data set by using n_distinct() function on ID column and we can observe that there are only 8 unique participants in this data set which is considered a small sample to conduct further analysis. Therefore, we will not analyze weightLogInfo_merged.csv any further. Following the same method for the remaining 2 data sets, we observe there are 33 unique participants on dailyActivity_merged.csv and 24 unique participants on sleepDay_merged.csv so we can conclude that these data sets are eligible for further analysis.

Phase - 3 Process

During the process phase we will import & explore the data and start the data cleaning process. We will check for missing or null values, reformat data types and perform preliminary statistical analysis.

3.1 Load data & import required packages.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.0 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(readr)
library(dplyr)
library(lubridate)
## Loading required package: timechange
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

Preview and explore data

data <- read_csv("/Users/yigitkasapoglu/Desktop//Data/CaseStudy1_R/dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sleep_data <- read_csv("/Users/yigitkasapoglu/Desktop/Data/CaseStudy1_R/sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(data)
## # A tibble: 6 × 15
##       Id Activ…¹ Total…² Total…³ Track…⁴ Logge…⁵ VeryA…⁶ Moder…⁷ Light…⁸ Seden…⁹
##    <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 1.50e9 4/12/2…   13162    8.5     8.5        0    1.88   0.550    6.06       0
## 2 1.50e9 4/13/2…   10735    6.97    6.97       0    1.57   0.690    4.71       0
## 3 1.50e9 4/14/2…   10460    6.74    6.74       0    2.44   0.400    3.91       0
## 4 1.50e9 4/15/2…    9762    6.28    6.28       0    2.14   1.26     2.83       0
## 5 1.50e9 4/16/2…   12669    8.16    8.16       0    2.71   0.410    5.04       0
## 6 1.50e9 4/17/2…    9705    6.48    6.48       0    3.19   0.780    2.51       0
## # … with 5 more variables: VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## #   LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>, and
## #   abbreviated variable names ¹​ActivityDate, ²​TotalSteps, ³​TotalDistance,
## #   ⁴​TrackerDistance, ⁵​LoggedActivitiesDistance, ⁶​VeryActiveDistance,
## #   ⁷​ModeratelyActiveDistance, ⁸​LightActiveDistance, ⁹​SedentaryActiveDistance

3.2 Check for missing values

sum(is.na(data))
## [1] 0
sum(is.na(sleep_data))
## [1] 0

No missing values found.

3.3 Check for null values

sum(is.null(data))
## [1] 0
sum(is.null(sleep_data))
## [1] 0

No null values found.

3.4 Check for structural errors and incorrect data types

str(data)
## spc_tbl_ [940 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id                      : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : num [1:940] 13162 10735 10460 9762 12669 ...
##  $ TotalDistance           : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : num [1:940] 728 776 1218 726 773 ...
##  $ Calories                : num [1:940] 1985 1797 1776 1745 1863 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   ActivityDate = col_character(),
##   ..   TotalSteps = col_double(),
##   ..   TotalDistance = col_double(),
##   ..   TrackerDistance = col_double(),
##   ..   LoggedActivitiesDistance = col_double(),
##   ..   VeryActiveDistance = col_double(),
##   ..   ModeratelyActiveDistance = col_double(),
##   ..   LightActiveDistance = col_double(),
##   ..   SedentaryActiveDistance = col_double(),
##   ..   VeryActiveMinutes = col_double(),
##   ..   FairlyActiveMinutes = col_double(),
##   ..   LightlyActiveMinutes = col_double(),
##   ..   SedentaryMinutes = col_double(),
##   ..   Calories = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(sleep_data)
## spc_tbl_ [413 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id                : num [1:413] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ SleepDay          : chr [1:413] "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
##  $ TotalSleepRecords : num [1:413] 1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: num [1:413] 327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : num [1:413] 346 407 442 367 712 320 377 364 384 449 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   SleepDay = col_character(),
##   ..   TotalSleepRecords = col_double(),
##   ..   TotalMinutesAsleep = col_double(),
##   ..   TotalTimeInBed = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

We can see that ActivityDate column on the first data set and SleepDay column on the second data set have incorrect data types.

Reformat incorrect date data

data$ActivityDate <- as.Date(data$ActivityDate, "%m/%d/%Y")
sleep_data$SleepDay <- as.Date(sleep_data$SleepDay, "%Y-%m-%d")

Both data types are reformatted from character to date and ready to analyze.

3.5 Check for duplicate IDs

n_distinct(data$Id)
## [1] 33
n_distinct(sleep_data$Id)
## [1] 24

We observe there are 33 participants on the first data set instead of 30 as claimed. Also there are 24 participants on the second data set which indicates that we do not have sleep data of 6 unique participants.

3.6 Add new column “WeekDays” for further analysis

data$WeekDays <- wday(data$ActivityDate, TRUE)

Now we can deepen our analysis by checking which weekdays has more logs and activities.

Relocate Weekdays column for ease of use

data %>% relocate("WeekDays", .after = "ActivityDate")
## # A tibble: 940 × 16
##            Id Activity…¹ WeekD…² Total…³ Total…⁴ Track…⁵ Logge…⁶ VeryA…⁷ Moder…⁸
##         <dbl> <date>     <ord>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 1503960366 2016-04-12 Tue       13162    8.5     8.5        0    1.88   0.550
##  2 1503960366 2016-04-13 Wed       10735    6.97    6.97       0    1.57   0.690
##  3 1503960366 2016-04-14 Thu       10460    6.74    6.74       0    2.44   0.400
##  4 1503960366 2016-04-15 Fri        9762    6.28    6.28       0    2.14   1.26 
##  5 1503960366 2016-04-16 Sat       12669    8.16    8.16       0    2.71   0.410
##  6 1503960366 2016-04-17 Sun        9705    6.48    6.48       0    3.19   0.780
##  7 1503960366 2016-04-18 Mon       13019    8.59    8.59       0    3.25   0.640
##  8 1503960366 2016-04-19 Tue       15506    9.88    9.88       0    3.53   1.32 
##  9 1503960366 2016-04-20 Wed       10544    6.68    6.68       0    1.96   0.480
## 10 1503960366 2016-04-21 Thu        9819    6.34    6.34       0    1.34   0.350
## # … with 930 more rows, 7 more variables: LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>, and abbreviated variable names
## #   ¹​ActivityDate, ²​WeekDays, ³​TotalSteps, ⁴​TotalDistance, ⁵​TrackerDistance,
## #   ⁶​LoggedActivitiesDistance, ⁷​VeryActiveDistance, ⁸​ModeratelyActiveDistance

3.7 Create new column “TotalMinutes”

data$TotalMinutes <- data$SedentaryMinutes + data$LightlyActiveMinutes + data$FairlyActiveMinutes + data$VeryActiveMinutes

Now, we can observe the total count of logged activity minutes.

Create new column “TotalHours” to see activity data in hours

data$TotalHours <- data$TotalMinutes/60

Phase 4 - Analyze

We will start our analysis by checking the summarized statistics of the data set and adding charts to see if there are any trends or patterns to identify.

summary(data)
##        Id             ActivityDate          TotalSteps    TotalDistance   
##  Min.   :1.504e+09   Min.   :2016-04-12   Min.   :    0   Min.   : 0.000  
##  1st Qu.:2.320e+09   1st Qu.:2016-04-19   1st Qu.: 3790   1st Qu.: 2.620  
##  Median :4.445e+09   Median :2016-04-26   Median : 7406   Median : 5.245  
##  Mean   :4.855e+09   Mean   :2016-04-26   Mean   : 7638   Mean   : 5.490  
##  3rd Qu.:6.962e+09   3rd Qu.:2016-05-04   3rd Qu.:10727   3rd Qu.: 7.713  
##  Max.   :8.878e+09   Max.   :2016-05-12   Max.   :36019   Max.   :28.030  
##                                                                           
##  TrackerDistance  LoggedActivitiesDistance VeryActiveDistance
##  Min.   : 0.000   Min.   :0.0000           Min.   : 0.000    
##  1st Qu.: 2.620   1st Qu.:0.0000           1st Qu.: 0.000    
##  Median : 5.245   Median :0.0000           Median : 0.210    
##  Mean   : 5.475   Mean   :0.1082           Mean   : 1.503    
##  3rd Qu.: 7.710   3rd Qu.:0.0000           3rd Qu.: 2.053    
##  Max.   :28.030   Max.   :4.9421           Max.   :21.920    
##                                                              
##  ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
##  Min.   :0.0000           Min.   : 0.000      Min.   :0.000000       
##  1st Qu.:0.0000           1st Qu.: 1.945      1st Qu.:0.000000       
##  Median :0.2400           Median : 3.365      Median :0.000000       
##  Mean   :0.5675           Mean   : 3.341      Mean   :0.001606       
##  3rd Qu.:0.8000           3rd Qu.: 4.782      3rd Qu.:0.000000       
##  Max.   :6.4800           Max.   :10.710      Max.   :0.110000       
##                                                                      
##  VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
##  Min.   :  0.00    Min.   :  0.00      Min.   :  0.0        Min.   :   0.0  
##  1st Qu.:  0.00    1st Qu.:  0.00      1st Qu.:127.0        1st Qu.: 729.8  
##  Median :  4.00    Median :  6.00      Median :199.0        Median :1057.5  
##  Mean   : 21.16    Mean   : 13.56      Mean   :192.8        Mean   : 991.2  
##  3rd Qu.: 32.00    3rd Qu.: 19.00      3rd Qu.:264.0        3rd Qu.:1229.5  
##  Max.   :210.00    Max.   :143.00      Max.   :518.0        Max.   :1440.0  
##                                                                             
##     Calories    WeekDays   TotalMinutes      TotalHours      
##  Min.   :   0   Sun:121   Min.   :   2.0   Min.   : 0.03333  
##  1st Qu.:1828   Mon:120   1st Qu.: 989.8   1st Qu.:16.49583  
##  Median :2134   Tue:152   Median :1440.0   Median :24.00000  
##  Mean   :2304   Wed:150   Mean   :1218.8   Mean   :20.31255  
##  3rd Qu.:2793   Thu:147   3rd Qu.:1440.0   3rd Qu.:24.00000  
##  Max.   :4900   Fri:126   Max.   :1440.0   Max.   :24.00000  
##                 Sat:124
summary(sleep_data)
##        Id               SleepDay   TotalSleepRecords TotalMinutesAsleep
##  Min.   :1.504e+09   Min.   :NA    Min.   :1.000     Min.   : 58.0     
##  1st Qu.:3.977e+09   1st Qu.:NA    1st Qu.:1.000     1st Qu.:361.0     
##  Median :4.703e+09   Median :NA    Median :1.000     Median :433.0     
##  Mean   :5.001e+09   Mean   :NaN   Mean   :1.119     Mean   :419.5     
##  3rd Qu.:6.962e+09   3rd Qu.:NA    3rd Qu.:1.000     3rd Qu.:490.0     
##  Max.   :8.792e+09   Max.   :NA    Max.   :3.000     Max.   :796.0     
##                      NA's   :413                                       
##  TotalTimeInBed 
##  Min.   : 61.0  
##  1st Qu.:403.0  
##  Median :463.0  
##  Mean   :458.6  
##  3rd Qu.:526.0  
##  Max.   :961.0  
## 

Findings based on summarized statistics

4.1 Plot a histogram chart to analyze user login frequency

ggplot(data = data, aes(x = `WeekDays`)) + 
    geom_histogram(binwidth = 1, fill = "slateblue2", color = "slateblue2", stat="count") + 
    labs(x = "Day", y = "Login Frequency", title = "Weekly Frequency of User Logins") + 
    theme_classic() + 
    theme(axis.line = element_line(colour = "black"),
          panel.grid.major = element_line(colour = "grey"))
## Warning in geom_histogram(binwidth = 1, fill = "slateblue2", color =
## "slateblue2", : Ignoring unknown parameters: `binwidth`, `bins`, and `pad`

We observe that Tuesdays, Wednesdays and Thursdays have the highest activity logs; therefore, we can conclude that participants are using the app mostly during weekdays.

4.2 Plot a bar chart to compare TotalSteps by WeekDays

ggplot(data, aes(x=WeekDays, y=TotalSteps, fill=WeekDays)) +
    geom_bar(stat="identity", width=0.5) +
    labs(title="Total Steps by Weekday", x="Weekday", y="Total Steps")

Data suggests that users were most active on Tuesdays Wednesday, Fridays and Saturdays.

Phase 5 - Share

In this section we will generate several charts to share our findings and insights.

ggplot(data, aes(x =ActivityDate, y =TotalSteps, group = 1)) +
    geom_line(stat = "summary", color = "blue") +
    labs(title = "Average Steps per Day", x = "Date", y = "Average Steps")
## No summary function supplied, defaulting to `mean_se()`

We can see in the chart above that the average steps per day is 7638 which is below the recommended amount by CDC(https://www.healthline.com/health/average-steps-per-day#guidelines).

ggplot(data, aes(x =ActivityDate, y =SedentaryMinutes, group = 1)) +
    geom_line(stat = "summary", color = "red") +
    labs(title = "Average Sedentary Minutes per Day", x = "Date", y = "Sedentary Minutes")
## No summary function supplied, defaulting to `mean_se()`

Another important finding is that the Average Sedentary Minutes Per Day is 16.5 hours. This data indicates that users are spending most of their time in non-active states which may bring increased health risks and other factors of related risks in the long term. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2996155/

ggplot(data, aes(x =ActivityDate, y =Calories, group = 1)) +
    geom_line(stat = "summary", color = "green") +
    labs(title = "Average Calories Burnt per Day", x = "Date", y = "Calories Burnt")
## No summary function supplied, defaulting to `mean_se()`

As we can see in the table above, users burnt 2304 calories on average per day. According to the recent findings, most female adults need 1,600 – 2,200 calories per day, while adult males need 2,200–3,000 calories per day. Based on this data, we can conclude that the users burnt enough calories to maintain their weight most of the time. These daily calorie goals can be adjusted based on personal goals. https://www.healthline.com/health/fitness-exercise/how-many-calories-do-i-burn-a-day#daily-calorie-burn-by-intent

Phase 6 - Act

To summarize the results of our analysis and communicate possible solutions, we have several key points that will be addressed below.

Recommended Action: The company can motive users to increase step counts by offering incentives & rewards to users who accomplish 10.000 steps milestone everyday. A streak counting tool can be added to promote consistency and more reward opportunities. This will also motivate users to log in to the app everyday to claim rewards thus, increasing user time spent on the app. Another feature for collaboration with friends can be added, so users can invite their friends & family to join them and achieve 10.000 steps per day together. This addition might help users to achieve their goals while helping the company to increase its number of users.

Recommended Action: Bellabeats tracker sensors gathers activity data from its users constantly. The company can address to this problem by adding a feature to the app that sends regular notifications to the user after reaching a certain threshold of sedentary minutes per day. Users can be motivated to stay active throughout the day by daily activity rewards and incentives.

Recommended Action: The company can provide nutritional assistance by offering on-demand nutritionist services. Moreover, a nutrition education page can be added to the app so users can better understand their diets and how to manage daily calorie intake based on personal goals.

Recommended Action: The Bellabeats app can be used to track sleeping patterns of the users and offer in-depth insights about their sleeping habits accompanied by advice on how to improve their sleep. This feature can be monetized by offering personalized sleeping assistance and more tracking features through the app.

Conclusion

In this brief report, we have gathered, processed, cleaned and analyzed the Fitbit Data to better understand user activity patterns and trends. We have found several key patterns & trends which led us to possible growth opportunities and improvements on user activities. Finally, we implemented these possible improvements as actionable insights to the Bellabeats app and marketing team. We have accomplished all business objectives & deliverables required by the company and offered further monetization opportunities.

References

  • MD, Kaggle
  • Katie Huang, Kaggle