Bellabeat, founded by Urška Sršen and Sando Mur, creates health-focused smart products empowering women with activity, sleep, stress, and reproductive health data since 2013. Expanding globally by 2016, they emphasize digital marketing, utilizing Google, Facebook, Instagram, Twitter, and Youtube. Sršen seeks to leverage smart device usage data for strategic marketing insights.
This capstone project is one of the important parts of the Google Professional Data Analyst Certification course. In this project, we will be exploring the data set on Bellabeat company, a high-tech manufacturer of health-focused products for women.
As a junior analyst working on the marketing analyst team at Bellabeat, I will be focusing on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices, subsequently, providing high-level recommendations for Bellabeat’s marketing strategy.
In this phase, I will analyze trends in smart device usage data to gain insights into how consumers utilize non-Bellabeat smart devices. This analysis will inform the development of Bellabeat’s marketing strategy.
Dataset: As asked by the stakeholders of the company, Fitbit Fitness Tracker Data has been used from the 3rd party open source, Kaggle. The data includes personal data of 30 users who agreed to share their personal information, including minute-level output for physical activity, heart rate, and sleep monitoring. It also includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.
Storage: Data is downloaded from Kaggle and stored in the local drive on MacOS.
Data Organization: Data is organised in long format, that is data is have more number of rows than columns.
Bias/ Credibility Issues: In this ROCCC will be analysed to measure the bias and credibility of data set.
R (Reliable): The reliability of this data is questionable due to the absence of information regarding the margin of error. Additionally, the small sample size of only 30 participants restricts the depth of analysis that can be conducted.
O (Originality): The data set is not original. The original data is generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016
C (Comprehensiveness): There is no demographic information about the users in the dataset, therefore, if data is concluded to be biased then insights will be unfair to all the type of users. In conclusion, data is not comprehensive.
C (Current): Data is not upto date. It was last updated 3 years ago. Therefore, it will not consider current trends going on in the smart device market.
C (Cited): As previously mentioned, the dataset was created by Amazon Mechanical Turk, but its credibility remains uncertain due to the lack of information regarding the source’s reliability.
In summary, the current dataset lacks sufficient data integrity and credibility. While it provides some insights into the business problem, obtaining more records and conducting further analysis is necessary to ensure reliability and reduce bias.
Licensing CC0: Public Domain
Out of 18 data folders in ‘Fitabase Data 4.12.16-5.12-1.16’, using the following data sets for further analysis:
dailyActivity_merged: 15 Columns and 940 Rows
heartrate_seconds_merged: 3 Columns and 24,83,658 Rows
sleepDay_merged: 5 columns and 413 Rows
weightLogInfo_merged: 8 Columns and 67 Rows
minuteMETsNarrow_merged:3 Columns and 13,25,580 Rows
We will begin process the data usinf tools MS Excel, Tableau and R studio.
library(tidyverse)
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(ggplot2)
library(dplyr)
library(tidyr)
#Reading csv files
daily_activity <- read_csv("/Users/ridhimabansal/Downloads/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
heart_rate_seconds <- read_csv("/Users/ridhimabansal/Downloads/Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv")
## Rows: 2483658 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Time
## dbl (2): Id, Value
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
METs_narrow_minutes <- read_csv("/Users/ridhimabansal/Downloads/Fitabase Data 4.12.16-5.12.16/minuteMETsNarrow_merged.csv")
## Rows: 1325580 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityMinute
## dbl (2): Id, METs
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sleep_day<- read_csv("/Users/ridhimabansal/Downloads/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
weight_info <- read_csv("/Users/ridhimabansal/Downloads/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")
## Rows: 67 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
## lgl (1): IsManualReport
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(daily_activity)
## # A tibble: 6 × 15
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 13162 8.5 8.5
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## # ℹ 10 more variables: LoggedActivitiesDistance <dbl>,
## # VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## # LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## # VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## # LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>
head(heart_rate_seconds)
## # A tibble: 6 × 3
## Id Time Value
## <dbl> <chr> <dbl>
## 1 2022484408 4/12/2016 7:21:00 AM 97
## 2 2022484408 4/12/2016 7:21:05 AM 102
## 3 2022484408 4/12/2016 7:21:10 AM 105
## 4 2022484408 4/12/2016 7:21:20 AM 103
## 5 2022484408 4/12/2016 7:21:25 AM 101
## 6 2022484408 4/12/2016 7:22:05 AM 95
head(METs_narrow_minutes)
## # A tibble: 6 × 3
## Id ActivityMinute METs
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 10
## 2 1503960366 4/12/2016 12:01:00 AM 10
## 3 1503960366 4/12/2016 12:02:00 AM 10
## 4 1503960366 4/12/2016 12:03:00 AM 10
## 5 1503960366 4/12/2016 12:04:00 AM 10
## 6 1503960366 4/12/2016 12:05:00 AM 12
head(sleep_day)
## # A tibble: 6 × 5
## Id SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:0… 1 327 346
## 2 1503960366 4/13/2016 12:0… 2 384 407
## 3 1503960366 4/15/2016 12:0… 1 412 442
## 4 1503960366 4/16/2016 12:0… 2 340 367
## 5 1503960366 4/17/2016 12:0… 1 700 712
## 6 1503960366 4/19/2016 12:0… 1 304 320
head(weight_info)
## # A tibble: 6 × 8
## Id Date WeightKg WeightPounds Fat BMI IsManualReport LogId
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl>
## 1 1503960366 5/2/2016 … 52.6 116. 22 22.6 TRUE 1.46e12
## 2 1503960366 5/3/2016 … 52.6 116. NA 22.6 TRUE 1.46e12
## 3 1927972279 4/13/2016… 134. 294. NA 47.5 FALSE 1.46e12
## 4 2873212765 4/21/2016… 56.7 125. NA 21.5 TRUE 1.46e12
## 5 2873212765 5/12/2016… 57.3 126. NA 21.7 TRUE 1.46e12
## 6 4319703577 4/17/2016… 72.4 160. 25 27.5 TRUE 1.46e12
From the above tables, except for daily_activity table,
it can be observed that Date and Time columns
are common. In order to simplify the analyses, date and time will be
separated using Separate() function.
###Separating Date and Time column
Heart Rate Table
heart_rate <- heart_rate_seconds %>%
separate(Time, c("Date","Time")," ")
## Warning: Expected 2 pieces. Additional pieces discarded in 2483658 rows [1, 2, 3, 4, 5,
## 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
METs Narrow Table
MET_narrow <- METs_narrow_minutes %>%
separate(ActivityMinute,c("Date", "Time"), " ")
## Warning: Expected 2 pieces. Additional pieces discarded in 1325580 rows [1, 2, 3, 4, 5,
## 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
Sleep Day Table
sleep_day_new <- sleep_day %>%
separate(SleepDay,c("Date", "Time"), " ")
## Warning: Expected 2 pieces. Additional pieces discarded in 413 rows [1, 2, 3, 4, 5, 6,
## 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
Weight Info Log Table
weight_info_new <- weight_info %>%
separate(Date,c("Date", "Time"), " ")
## Warning: Expected 2 pieces. Additional pieces discarded in 67 rows [1, 2, 3, 4, 5, 6, 7,
## 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
colnames(daily_activity)
## [1] "Id" "ActivityDate"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories"
colnames(heart_rate)
## [1] "Id" "Date" "Time" "Value"
colnames(MET_narrow)
## [1] "Id" "Date" "Time" "METs"
colnames(sleep_day_new)
## [1] "Id" "Date" "Time"
## [4] "TotalSleepRecords" "TotalMinutesAsleep" "TotalTimeInBed"
colnames(weight_info_new)
## [1] "Id" "Date" "Time" "WeightKg"
## [5] "WeightPounds" "Fat" "BMI" "IsManualReport"
## [9] "LogId"
nrow(daily_activity)
## [1] 940
nrow(heart_rate)
## [1] 2483658
nrow(MET_narrow)
## [1] 1325580
nrow(sleep_day_new)
## [1] 413
nrow(weight_info_new)
## [1] 67
nrow(daily_activity[duplicated(daily_activity),])
## [1] 0
nrow(heart_rate[duplicated(heart_rate),])
## [1] 9334
nrow(MET_narrow[duplicated(MET_narrow),])
## [1] 289582
nrow(sleep_day_new[duplicated(sleep_day_new),])
## [1] 3
nrow(weight_info_new[duplicated(weight_info_new),])
## [1] 0
heart_rate <- unique(heart_rate)
nrow(heart_rate)
## [1] 2474324
MET_narrow <- unique(MET_narrow)
nrow(MET_narrow)
## [1] 1035998
sleep_day_new <- unique(sleep_day_new)
nrow(sleep_day_new)
## [1] 410
Records with missing or “0” values have been excluded to mitigate data skewness during analysis.
daily_activity <- daily_activity %>% filter(TotalSteps !=0)
daily_activity <- daily_activity %>% filter(TotalDistance !=0)
From daily_Activity table 78 records have been removed
where total steps and total distance data were missing (value is 0).
MET_narrow <- MET_narrow %>% filter(METs!=0)
From MET_Narrow table 7 records have been removed where
METs were missing (value is 0).
###Identifying distinct ID in each table
n_distinct(daily_activity$Id) #Number of distinct users in Daily Activity data
## [1] 33
n_distinct(heart_rate$Id) #Number of distinct users in Heart rate data
## [1] 14
n_distinct(MET_narrow$Id) #Number of distinct users in MET_Narrow data
## [1] 33
n_distinct(sleep_day_new$Id) #Number of distinct users in Sleep data
## [1] 24
n_distinct(weight_info_new$Id) #Number of distinct users in Weight Info Log data
## [1] 8
Observing the tables above, it’s evident that both the Activity dataset and the MET narrow data set contain an equal number of unique users. Consequently, to facilitate further analysis, these data sets have been merged.
Daily_activity and MET_narrowdaily_activity <- daily_activity %>%
rename(Date = ActivityDate)
activity <- merge(daily_activity, MET_narrow, by = c("Id", "Date"))
At this stage,ETL (Extraction, Transform and Loading) process has been performed on the data sets. Subsequently, we summarize each data set and extract valuable insights.
activity %>%
select(TotalSteps,
TotalDistance,
VeryActiveMinutes,
FairlyActiveMinutes,
LightlyActiveMinutes,
SedentaryMinutes,
Calories, METs) %>%
summary()
## TotalSteps TotalDistance VeryActiveMinutes FairlyActiveMinutes
## Min. : 8 Min. : 0.010 Min. : 0.00 Min. : 0.00
## 1st Qu.: 5267 1st Qu.: 3.620 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 8538 Median : 6.030 Median : 9.00 Median : 9.00
## Mean : 8761 Mean : 6.307 Mean : 24.86 Mean : 15.67
## 3rd Qu.:11423 3rd Qu.: 8.160 3rd Qu.: 38.00 3rd Qu.: 22.00
## Max. :36019 Max. :28.030 Max. :210.00 Max. :143.00
## LightlyActiveMinutes SedentaryMinutes Calories METs
## Min. : 0.0 Min. : 0 Min. : 257 Min. : 6.00
## 1st Qu.:156.0 1st Qu.: 716 1st Qu.:1899 1st Qu.: 10.00
## Median :217.0 Median : 981 Median :2275 Median : 10.00
## Mean :219.9 Mean : 940 Mean :2422 Mean : 16.27
## 3rd Qu.:279.0 3rd Qu.:1167 3rd Qu.:2889 3rd Qu.: 13.00
## Max. :518.0 Max. :1440 Max. :4900 Max. :157.00
From the above summary, following obervations can be drawn:
heart_rate %>%
select(Value) %>%
summary()
## Value
## Min. : 36.00
## 1st Qu.: 63.00
## Median : 73.00
## Mean : 77.36
## 3rd Qu.: 88.00
## Max. :203.00
Observations:
The maximum heart rate, 203 rate measure after every 5 seconds, significantly exceeds the 3rd quartile range of 88 rate per 5 seconds. This suggests either high-intensity physical activity among some users or the presence of outliers in the data.
A normal resting heart rate for adults ranges from 60-100 beats per minute. With an average heart rate of 77 beats measured after every 5 seconds, it indicates a relatively healthy resting heart rate. However, individual variations and factors such as age, fitness level, and health conditions can influence heart rate measurements.
sleep_day_new %>%
select(TotalMinutesAsleep,TotalTimeInBed) %>%
summary()
## TotalMinutesAsleep TotalTimeInBed
## Min. : 58.0 Min. : 61.0
## 1st Qu.:361.0 1st Qu.:403.8
## Median :432.5 Median :463.0
## Mean :419.2 Mean :458.5
## 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :796.0 Max. :961.0
Observations
On average, users sleep for 419 minutes (7hours) out of 458 minutes(7.6 hours) in bed, indicating significant time spent in bed.
The average sleep duration is approximately 7 hours, aligning with typical sleep requirements.
However, some users spend 13-16 hours in bed (max), exceeding the expected 8-9 hours(3rd quartile), suggesting low physical activity levels and excessive time spent sleeping.
weight_info_new %>%
select(WeightKg,BMI) %>%
summary()
## WeightKg BMI
## Min. : 52.60 Min. :21.45
## 1st Qu.: 61.40 1st Qu.:23.96
## Median : 62.50 Median :24.39
## Mean : 72.04 Mean :25.19
## 3rd Qu.: 85.05 3rd Qu.:25.56
## Max. :133.50 Max. :47.54
Observations
1.The symmetry between the average BMI (25.19) and the third quartile BMI (25.5) suggests a balanced distribution, with the middle 50% of the data evenly spread around the mean. An extreme value with a BMI of 47.5 indicates outliers.
Most users fall within the normal (18.5-24.9) or overweight (25-29.9) BMI categories, indicating potential health risks for some individuals.
With an average weight of 72 kg, interpretation is limited without considering factors such as age and gender, which significantly influence weight distribution and health assessments.
Valuable insights into user activity levels, sleep patterns, heart rate and BMI distribution are provided by the analysis of smart device data.
A wide range of consumer needs and preferences could be addressed by Bellabeat’s products, in particular health and fitness tracking.
3.Product development and marketing strategies can be informed by an understanding of the most important trends, such as a high prevalence of normal body mass index and extensive use of sedentary activities.
4.The depth of analysis has been reduced and there is a need for further data collection efforts due to the lack of complete information, in particular on fats content, gender, age, etc.
Product Improvement: Develop features that encourage users to increase physical activity and improve sleep quality in line with the goal of promoting healthier lifestyles.
Goal Marketing: Adapt marketing campaigns to the specific needs of users in different BMI categories, highlighting the benefits. Bellabeat products to support overall wellness.
Data collection: Prioritize the collection of comprehensive user data, including body fat percentage, demographic data, to improve the accuracy and depth of analysis on smart devices.
Partnerships and collaborations: Explore partnerships with health and wellness organizations to leverage their health lifestyle expertise and expand Bellabeat’s market reach.
Customer Education: Provide educational resources and content to help users better understand the meaning of metrics like BMI, heart rate and activity levels that enable them to make informed decisions about health and fitness.