In this report we will work on a data set provided by Bellabeats, a high-tech firm that specializes on health tracking smart devices for women.
Data can be downloaded here https://www.kaggle.com/datasets/arashnic/fitbit.
The primary objective is to find trends, patterns or any indications that might lead us to useful insights for Bellabeats marketing analytics team.
In the first step, we define the context, background, key players, problem and objectives of our case.
Bellabeat is a high-tech manufacturer of beautifully-designed health-focused smart products for women since 2013. Inspiring and empowering women with knowledge about their own health and habits, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for females.
The co-founder and Chief Creative Officer, Urška Sršen is confident that an analysis of non-Bellebeat consumer data (ie. FitBit fitness tracker usage data) would reveal more opportunities for growth.
Analyze FitBit fitness tracker data to gain insights into how consumers are using the FitBit app and discover trends for Bellabeat marketing strategy.
Urska Srsen: Bellabeat’s cofounder and Chief Creative Officer
Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team
Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy.
In the second step, we clarify the data sources and limitations of the data set.
I will use FitBit Fitness Tracker Data. Data is generated from a survey on Amazon Mechanical Turk between 12 March 2016 to 12 May 2016. This Kaggle data set contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore user habits.
According to Google, reliable data sources are ROCCC which stands for Reliable, Original, Comprehensive, Current and Cited. For this data set;
Overall, the data set is considered low quality data and it is not recommended to produce business recommendations based on this data.
Among 18 data sets provided in this case study, only 6 files are relevant to our research. These are;
Out of these 6 data sets, dailyActivity_merged.csv already includes crucial activity and calorie data from dailySteps_merged.csv, dailyIntensities_merged.csv and dailyCalories_merged.csv files so we disregards these files for further analysis.
Out of 3 remaining files, we check weightLogInfo_merged.csv data set by using n_distinct() function on ID column and we can observe that there are only 8 unique participants in this data set which is considered a small sample to conduct further analysis. Therefore, we will not analyze weightLogInfo_merged.csv any further. Following the same method for the remaining 2 data sets, we observe there are 33 unique participants on dailyActivity_merged.csv and 24 unique participants on sleepDay_merged.csv so we can conclude that these data sets are eligible for further analysis.
During the process phase we will import & explore the data and start the data cleaning process. We will check for missing or null values, reformat data types and perform preliminary statistical analysis.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.0
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(readr)
library(dplyr)
library(lubridate)
## Loading required package: timechange
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
data <- read_csv("/Users/yigitkasapoglu/Desktop//Data/CaseStudy1_R/dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sleep_data <- read_csv("/Users/yigitkasapoglu/Desktop/Data/CaseStudy1_R/sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(data)
## # A tibble: 6 × 15
## Id Activ…¹ Total…² Total…³ Track…⁴ Logge…⁵ VeryA…⁶ Moder…⁷ Light…⁸ Seden…⁹
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2… 13162 8.5 8.5 0 1.88 0.550 6.06 0
## 2 1.50e9 4/13/2… 10735 6.97 6.97 0 1.57 0.690 4.71 0
## 3 1.50e9 4/14/2… 10460 6.74 6.74 0 2.44 0.400 3.91 0
## 4 1.50e9 4/15/2… 9762 6.28 6.28 0 2.14 1.26 2.83 0
## 5 1.50e9 4/16/2… 12669 8.16 8.16 0 2.71 0.410 5.04 0
## 6 1.50e9 4/17/2… 9705 6.48 6.48 0 3.19 0.780 2.51 0
## # … with 5 more variables: VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## # LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>, and
## # abbreviated variable names ¹ActivityDate, ²TotalSteps, ³TotalDistance,
## # ⁴TrackerDistance, ⁵LoggedActivitiesDistance, ⁶VeryActiveDistance,
## # ⁷ModeratelyActiveDistance, ⁸LightActiveDistance, ⁹SedentaryActiveDistance
sum(is.na(data))
## [1] 0
sum(is.na(sleep_data))
## [1] 0
No missing values found.
sum(is.null(data))
## [1] 0
sum(is.null(sleep_data))
## [1] 0
No null values found.
str(data)
## spc_tbl_ [940 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDate : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ TotalSteps : num [1:940] 13162 10735 10460 9762 12669 ...
## $ TotalDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : num [1:940] 728 776 1218 726 773 ...
## $ Calories : num [1:940] 1985 1797 1776 1745 1863 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. ActivityDate = col_character(),
## .. TotalSteps = col_double(),
## .. TotalDistance = col_double(),
## .. TrackerDistance = col_double(),
## .. LoggedActivitiesDistance = col_double(),
## .. VeryActiveDistance = col_double(),
## .. ModeratelyActiveDistance = col_double(),
## .. LightActiveDistance = col_double(),
## .. SedentaryActiveDistance = col_double(),
## .. VeryActiveMinutes = col_double(),
## .. FairlyActiveMinutes = col_double(),
## .. LightlyActiveMinutes = col_double(),
## .. SedentaryMinutes = col_double(),
## .. Calories = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(sleep_data)
## spc_tbl_ [413 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:413] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ SleepDay : chr [1:413] "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
## $ TotalSleepRecords : num [1:413] 1 2 1 2 1 1 1 1 1 1 ...
## $ TotalMinutesAsleep: num [1:413] 327 384 412 340 700 304 360 325 361 430 ...
## $ TotalTimeInBed : num [1:413] 346 407 442 367 712 320 377 364 384 449 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. SleepDay = col_character(),
## .. TotalSleepRecords = col_double(),
## .. TotalMinutesAsleep = col_double(),
## .. TotalTimeInBed = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
We can see that ActivityDate column on the first data set and SleepDay column on the second data set have incorrect data types.
data$ActivityDate <- as.Date(data$ActivityDate, "%m/%d/%Y")
sleep_data$SleepDay <- as.Date(sleep_data$SleepDay, "%Y-%m-%d")
Both data types are reformatted from character to date and ready to analyze.
n_distinct(data$Id)
## [1] 33
n_distinct(sleep_data$Id)
## [1] 24
We observe there are 33 participants on the first data set instead of 30 as claimed. Also there are 24 participants on the second data set which indicates that we do not have sleep data of 6 unique participants.
data$WeekDays <- wday(data$ActivityDate, TRUE)
Now we can deepen our analysis by checking which weekdays has more logs and activities.
data %>% relocate("WeekDays", .after = "ActivityDate")
## # A tibble: 940 × 16
## Id Activity…¹ WeekD…² Total…³ Total…⁴ Track…⁵ Logge…⁶ VeryA…⁷ Moder…⁸
## <dbl> <date> <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1503960366 2016-04-12 Tue 13162 8.5 8.5 0 1.88 0.550
## 2 1503960366 2016-04-13 Wed 10735 6.97 6.97 0 1.57 0.690
## 3 1503960366 2016-04-14 Thu 10460 6.74 6.74 0 2.44 0.400
## 4 1503960366 2016-04-15 Fri 9762 6.28 6.28 0 2.14 1.26
## 5 1503960366 2016-04-16 Sat 12669 8.16 8.16 0 2.71 0.410
## 6 1503960366 2016-04-17 Sun 9705 6.48 6.48 0 3.19 0.780
## 7 1503960366 2016-04-18 Mon 13019 8.59 8.59 0 3.25 0.640
## 8 1503960366 2016-04-19 Tue 15506 9.88 9.88 0 3.53 1.32
## 9 1503960366 2016-04-20 Wed 10544 6.68 6.68 0 1.96 0.480
## 10 1503960366 2016-04-21 Thu 9819 6.34 6.34 0 1.34 0.350
## # … with 930 more rows, 7 more variables: LightActiveDistance <dbl>,
## # SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## # FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## # SedentaryMinutes <dbl>, Calories <dbl>, and abbreviated variable names
## # ¹ActivityDate, ²WeekDays, ³TotalSteps, ⁴TotalDistance, ⁵TrackerDistance,
## # ⁶LoggedActivitiesDistance, ⁷VeryActiveDistance, ⁸ModeratelyActiveDistance
data$TotalMinutes <- data$SedentaryMinutes + data$LightlyActiveMinutes + data$FairlyActiveMinutes + data$VeryActiveMinutes
Now, we can observe the total count of logged activity minutes.
data$TotalHours <- data$TotalMinutes/60
We will start our analysis by checking the summarized statistics of the data set and adding charts to see if there are any trends or patterns to identify.
summary(data)
## Id ActivityDate TotalSteps TotalDistance
## Min. :1.504e+09 Min. :2016-04-12 Min. : 0 Min. : 0.000
## 1st Qu.:2.320e+09 1st Qu.:2016-04-19 1st Qu.: 3790 1st Qu.: 2.620
## Median :4.445e+09 Median :2016-04-26 Median : 7406 Median : 5.245
## Mean :4.855e+09 Mean :2016-04-26 Mean : 7638 Mean : 5.490
## 3rd Qu.:6.962e+09 3rd Qu.:2016-05-04 3rd Qu.:10727 3rd Qu.: 7.713
## Max. :8.878e+09 Max. :2016-05-12 Max. :36019 Max. :28.030
##
## TrackerDistance LoggedActivitiesDistance VeryActiveDistance
## Min. : 0.000 Min. :0.0000 Min. : 0.000
## 1st Qu.: 2.620 1st Qu.:0.0000 1st Qu.: 0.000
## Median : 5.245 Median :0.0000 Median : 0.210
## Mean : 5.475 Mean :0.1082 Mean : 1.503
## 3rd Qu.: 7.710 3rd Qu.:0.0000 3rd Qu.: 2.053
## Max. :28.030 Max. :4.9421 Max. :21.920
##
## ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
## Min. :0.0000 Min. : 0.000 Min. :0.000000
## 1st Qu.:0.0000 1st Qu.: 1.945 1st Qu.:0.000000
## Median :0.2400 Median : 3.365 Median :0.000000
## Mean :0.5675 Mean : 3.341 Mean :0.001606
## 3rd Qu.:0.8000 3rd Qu.: 4.782 3rd Qu.:0.000000
## Max. :6.4800 Max. :10.710 Max. :0.110000
##
## VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:127.0 1st Qu.: 729.8
## Median : 4.00 Median : 6.00 Median :199.0 Median :1057.5
## Mean : 21.16 Mean : 13.56 Mean :192.8 Mean : 991.2
## 3rd Qu.: 32.00 3rd Qu.: 19.00 3rd Qu.:264.0 3rd Qu.:1229.5
## Max. :210.00 Max. :143.00 Max. :518.0 Max. :1440.0
##
## Calories WeekDays TotalMinutes TotalHours
## Min. : 0 Sun:121 Min. : 2.0 Min. : 0.03333
## 1st Qu.:1828 Mon:120 1st Qu.: 989.8 1st Qu.:16.49583
## Median :2134 Tue:152 Median :1440.0 Median :24.00000
## Mean :2304 Wed:150 Mean :1218.8 Mean :20.31255
## 3rd Qu.:2793 Thu:147 3rd Qu.:1440.0 3rd Qu.:24.00000
## Max. :4900 Fri:126 Max. :1440.0 Max. :24.00000
## Sat:124
summary(sleep_data)
## Id SleepDay TotalSleepRecords TotalMinutesAsleep
## Min. :1.504e+09 Min. :NA Min. :1.000 Min. : 58.0
## 1st Qu.:3.977e+09 1st Qu.:NA 1st Qu.:1.000 1st Qu.:361.0
## Median :4.703e+09 Median :NA Median :1.000 Median :433.0
## Mean :5.001e+09 Mean :NaN Mean :1.119 Mean :419.5
## 3rd Qu.:6.962e+09 3rd Qu.:NA 3rd Qu.:1.000 3rd Qu.:490.0
## Max. :8.792e+09 Max. :NA Max. :3.000 Max. :796.0
## NA's :413
## TotalTimeInBed
## Min. : 61.0
## 1st Qu.:403.0
## Median :463.0
## Mean :458.6
## 3rd Qu.:526.0
## Max. :961.0
##
Findings based on summarized statistics
ggplot(data = data, aes(x = `WeekDays`)) +
geom_histogram(binwidth = 1, fill = "slateblue2", color = "slateblue2", stat="count") +
labs(x = "Day", y = "Login Frequency", title = "Weekly Frequency of User Logins") +
theme_classic() +
theme(axis.line = element_line(colour = "black"),
panel.grid.major = element_line(colour = "grey"))
## Warning in geom_histogram(binwidth = 1, fill = "slateblue2", color =
## "slateblue2", : Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
We observe that Tuesdays, Wednesdays and Thursdays have the highest activity logs; therefore, we can conclude that participants are using the app mostly during weekdays.
ggplot(data, aes(x=WeekDays, y=TotalSteps, fill=WeekDays)) +
geom_bar(stat="identity", width=0.5) +
labs(title="Total Steps by Weekday", x="Weekday", y="Total Steps")
Data suggests that users were most active on Tuesdays Wednesday, Fridays and Saturdays.
To summarize the results of our analysis and communicate possible solutions, we have several key points that will be addressed below.
Recommended Action: The company can motive users to increase step counts by offering incentives & rewards to users who accomplish 10.000 steps milestone everyday. A streak counting tool can be added to promote consistency and more reward opportunities. This will also motivate users to log in to the app everyday to claim rewards thus, increasing user time spent on the app. Another feature for collaboration with friends can be added, so users can invite their friends & family to join them and achieve 10.000 steps per day together. This addition might help users to achieve their goals while helping the company to increase its number of users.
Recommended Action: Bellabeats tracker sensors gathers activity data from its users constantly. The company can address to this problem by adding a feature to the app that sends regular notifications to the user after reaching a certain threshold of sedentary minutes per day. Users can be motivated to stay active throughout the day by daily activity rewards and incentives.
Recommended Action: The company can provide nutritional assistance by offering on-demand nutritionist services. Moreover, a nutrition education page can be added to the app so users can better understand their diets and how to manage daily calorie intake based on personal goals.
Recommended Action: The Bellabeats app can be used to track sleeping patterns of the users and offer in-depth insights about their sleeping habits accompanied by advice on how to improve their sleep. This feature can be monetized by offering personalized sleeping assistance and more tracking features through the app.
In this brief report, we have gathered, processed, cleaned and analyzed the Fitbit Data to better understand user activity patterns and trends. We have found several key patterns & trends which led us to possible growth opportunities and improvements on user activities. Finally, we implemented these possible improvements as actionable insights to the Bellabeats app and marketing team. We have accomplished all business objectives & deliverables required by the company and offered further monetization opportunities.