Bellabeat Case Study

Bellabeat, founded by Urška Sršen and Sando Mur, creates health-focused smart products empowering women with activity, sleep, stress, and reproductive health data since 2013. Expanding globally by 2016, they emphasize digital marketing, utilizing Google, Facebook, Instagram, Twitter, and Youtube. Sršen seeks to leverage smart device usage data for strategic marketing insights.

About Project

This capstone project is one of the important parts of the Google Professional Data Analyst Certification course. In this project, we will be exploring the data set on Bellabeat company, a high-tech manufacturer of health-focused products for women.

As a junior analyst working on the marketing analyst team at Bellabeat, I will be focusing on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices, subsequently, providing high-level recommendations for Bellabeat’s marketing strategy.

Phase 1: Ask

Business Problem

In this phase, I will analyze trends in smart device usage data to gain insights into how consumers utilize non-Bellabeat smart devices. This analysis will inform the development of Bellabeat’s marketing strategy.

Key Stakeholders

Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer
Sando Mur : Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team
Bellabeat marketing analytics team : Data Analyst team that helps guide Bellabeat’s marketing strategy

Deliverable

Clear summary of the business task.
Description of the data source.
ETL (Extratct, Transform and Load) process on the data.
Clear summary of data analysis.
Prepare data visualization.
Recommendations based on the insights.

Phase 2: Prepare

Dataset: As asked by the stakeholders of the company, Fitbit Fitness Tracker Data has been used from the 3rd party open source, Kaggle. The data includes personal data of 30 users who agreed to share their personal information, including minute-level output for physical activity, heart rate, and sleep monitoring. It also includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.
Storage: Data is downloaded from Kaggle and stored in the local drive on MacOS.
Data Organization: Data is organised in long format, that is data is have more number of rows than columns.
Bias/ Credibility Issues: In this ROCCC will be analysed to measure the bias and credibility of data set.

R (Reliable): The reliability of this data is questionable due to the absence of information regarding the margin of error. Additionally, the small sample size of only 30 participants restricts the depth of analysis that can be conducted.
O (Originality): The data set is not original. The original data is generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016
C (Comprehensiveness): There is no demographic information about the users in the dataset, therefore, if data is concluded to be biased then insights will be unfair to all the type of users. In conclusion, data is not comprehensive.
C (Current): Data is not upto date. It was last updated 3 years ago. Therefore, it will not consider current trends going on in the smart device market.
C (Cited): As previously mentioned, the dataset was created by Amazon Mechanical Turk, but its credibility remains uncertain due to the lack of information regarding the source’s reliability.

In summary, the current dataset lacks sufficient data integrity and credibility. While it provides some insights into the business problem, obtaining more records and conducting further analysis is necessary to ensure reliability and reduce bias.

Licensing CC0: Public Domain

Data Sorting & Filtering

Out of 18 data folders in ‘Fitabase Data 4.12.16-5.12-1.16’, using the following data sets for further analysis:

dailyActivity_merged: 15 Columns and 940 Rows
heartrate_seconds_merged: 3 Columns and 24,83,658 Rows
sleepDay_merged: 5 columns and 413 Rows
weightLogInfo_merged: 8 Columns and 67 Rows
minuteMETsNarrow_merged:3 Columns and 13,25,580 Rows

Date and Time columns will be created separately for easy analysis.
Data formats are checked before proceeding further.
Files will be merged on the basis of ID and Date.

Phase 3: Process

We will begin process the data usinf tools MS Excel, Tableau and R studio.

Loading Packages

library(tidyverse)

## Warning: package 'readr' was built under R version 4.2.3

## Warning: package 'dplyr' was built under R version 4.2.3

## Warning: package 'stringr' was built under R version 4.2.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)
library(ggplot2)
library(dplyr)
library(tidyr)

Importing Data Sets

#Reading csv files

daily_activity <- read_csv("/Users/ridhimabansal/Downloads/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")

## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

heart_rate_seconds <- read_csv("/Users/ridhimabansal/Downloads/Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv")

## Rows: 2483658 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Time
## dbl (2): Id, Value
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

METs_narrow_minutes <- read_csv("/Users/ridhimabansal/Downloads/Fitabase Data 4.12.16-5.12.16/minuteMETsNarrow_merged.csv")

## Rows: 1325580 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityMinute
## dbl (2): Id, METs
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

sleep_day<- read_csv("/Users/ridhimabansal/Downloads/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")

## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

weight_info <- read_csv("/Users/ridhimabansal/Downloads/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")

## Rows: 67 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
## lgl (1): IsManualReport
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Verifying & Analyzing Datasets

head(daily_activity)

## # A tibble: 6 × 15
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
##        <dbl> <chr>             <dbl>         <dbl>           <dbl>
## 1 1503960366 4/12/2016         13162          8.5             8.5 
## 2 1503960366 4/13/2016         10735          6.97            6.97
## 3 1503960366 4/14/2016         10460          6.74            6.74
## 4 1503960366 4/15/2016          9762          6.28            6.28
## 5 1503960366 4/16/2016         12669          8.16            8.16
## 6 1503960366 4/17/2016          9705          6.48            6.48
## # ℹ 10 more variables: LoggedActivitiesDistance <dbl>,
## #   VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## #   LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## #   VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## #   LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>

head(heart_rate_seconds)

## # A tibble: 6 × 3
##           Id Time                 Value
##        <dbl> <chr>                <dbl>
## 1 2022484408 4/12/2016 7:21:00 AM    97
## 2 2022484408 4/12/2016 7:21:05 AM   102
## 3 2022484408 4/12/2016 7:21:10 AM   105
## 4 2022484408 4/12/2016 7:21:20 AM   103
## 5 2022484408 4/12/2016 7:21:25 AM   101
## 6 2022484408 4/12/2016 7:22:05 AM    95

head(METs_narrow_minutes)

## # A tibble: 6 × 3
##           Id ActivityMinute         METs
##        <dbl> <chr>                 <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM    10
## 2 1503960366 4/12/2016 12:01:00 AM    10
## 3 1503960366 4/12/2016 12:02:00 AM    10
## 4 1503960366 4/12/2016 12:03:00 AM    10
## 5 1503960366 4/12/2016 12:04:00 AM    10
## 6 1503960366 4/12/2016 12:05:00 AM    12

head(sleep_day)

## # A tibble: 6 × 5
##           Id SleepDay        TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
##        <dbl> <chr>                       <dbl>              <dbl>          <dbl>
## 1 1503960366 4/12/2016 12:0…                 1                327            346
## 2 1503960366 4/13/2016 12:0…                 2                384            407
## 3 1503960366 4/15/2016 12:0…                 1                412            442
## 4 1503960366 4/16/2016 12:0…                 2                340            367
## 5 1503960366 4/17/2016 12:0…                 1                700            712
## 6 1503960366 4/19/2016 12:0…                 1                304            320

head(weight_info)

## # A tibble: 6 × 8
##           Id Date       WeightKg WeightPounds   Fat   BMI IsManualReport   LogId
##        <dbl> <chr>         <dbl>        <dbl> <dbl> <dbl> <lgl>            <dbl>
## 1 1503960366 5/2/2016 …     52.6         116.    22  22.6 TRUE           1.46e12
## 2 1503960366 5/3/2016 …     52.6         116.    NA  22.6 TRUE           1.46e12
## 3 1927972279 4/13/2016…    134.          294.    NA  47.5 FALSE          1.46e12
## 4 2873212765 4/21/2016…     56.7         125.    NA  21.5 TRUE           1.46e12
## 5 2873212765 5/12/2016…     57.3         126.    NA  21.7 TRUE           1.46e12
## 6 4319703577 4/17/2016…     72.4         160.    25  27.5 TRUE           1.46e12

From the above tables, except for daily_activity table, it can be observed that Date and Time columns are common. In order to simplify the analyses, date and time will be separated using Separate() function.

###Separating Date and Time column

Heart Rate Table

heart_rate <- heart_rate_seconds %>%
   separate(Time, c("Date","Time")," ")

## Warning: Expected 2 pieces. Additional pieces discarded in 2483658 rows [1, 2, 3, 4, 5,
## 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

METs Narrow Table

MET_narrow <- METs_narrow_minutes %>% 
  separate(ActivityMinute,c("Date", "Time"), " ")

## Warning: Expected 2 pieces. Additional pieces discarded in 1325580 rows [1, 2, 3, 4, 5,
## 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

Sleep Day Table

sleep_day_new <- sleep_day %>% 
  separate(SleepDay,c("Date", "Time"), " ")

## Warning: Expected 2 pieces. Additional pieces discarded in 413 rows [1, 2, 3, 4, 5, 6,
## 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

Weight Info Log Table

weight_info_new <- weight_info %>% 
  separate(Date,c("Date", "Time"), " ")

## Warning: Expected 2 pieces. Additional pieces discarded in 67 rows [1, 2, 3, 4, 5, 6, 7,
## 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

Column Names for each Table

colnames(daily_activity)

##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"

colnames(heart_rate)

## [1] "Id"    "Date"  "Time"  "Value"

colnames(MET_narrow)

## [1] "Id"   "Date" "Time" "METs"

colnames(sleep_day_new)

## [1] "Id"                 "Date"               "Time"              
## [4] "TotalSleepRecords"  "TotalMinutesAsleep" "TotalTimeInBed"

colnames(weight_info_new)

## [1] "Id"             "Date"           "Time"           "WeightKg"      
## [5] "WeightPounds"   "Fat"            "BMI"            "IsManualReport"
## [9] "LogId"

Identifying number of rows in each table

nrow(daily_activity)

## [1] 940

nrow(heart_rate)

## [1] 2483658

nrow(MET_narrow)

## [1] 1325580

nrow(sleep_day_new)

## [1] 413

nrow(weight_info_new)

## [1] 67

Identifying Duplicacy

nrow(daily_activity[duplicated(daily_activity),])

## [1] 0

nrow(heart_rate[duplicated(heart_rate),])

## [1] 9334

nrow(MET_narrow[duplicated(MET_narrow),])

## [1] 289582

nrow(sleep_day_new[duplicated(sleep_day_new),])

## [1] 3

nrow(weight_info_new[duplicated(weight_info_new),])

## [1] 0

Removing duplicate rows

heart_rate <- unique(heart_rate)
nrow(heart_rate)

## [1] 2474324

MET_narrow <- unique(MET_narrow)
nrow(MET_narrow)

## [1] 1035998

sleep_day_new <- unique(sleep_day_new)
nrow(sleep_day_new)

## [1] 410

Removing Missing Records

Records with missing or “0” values have been excluded to mitigate data skewness during analysis.

daily_activity <- daily_activity %>% filter(TotalSteps !=0)
daily_activity <- daily_activity %>% filter(TotalDistance !=0)

From daily_Activity table 78 records have been removed where total steps and total distance data were missing (value is 0).

MET_narrow <- MET_narrow %>% filter(METs!=0)

From MET_Narrow table 7 records have been removed where METs were missing (value is 0).

###Identifying distinct ID in each table

n_distinct(daily_activity$Id) #Number of distinct users in Daily Activity data

## [1] 33

n_distinct(heart_rate$Id)  #Number of distinct users in Heart rate data

## [1] 14

n_distinct(MET_narrow$Id)  #Number of distinct users in MET_Narrow data

## [1] 33

n_distinct(sleep_day_new$Id) #Number of distinct users in Sleep data

## [1] 24

n_distinct(weight_info_new$Id) #Number of distinct users in Weight Info Log data

## [1] 8

Observing the tables above, it’s evident that both the Activity dataset and the MET narrow data set contain an equal number of unique users. Consequently, to facilitate further analysis, these data sets have been merged.

Joining `Daily_activity` and `MET_narrow`

daily_activity <- daily_activity %>% 
  rename(Date = ActivityDate)

activity <- merge(daily_activity, MET_narrow, by = c("Id", "Date"))

Phase 4: Analyze

At this stage,ETL (Extraction, Transform and Loading) process has been performed on the data sets. Subsequently, we summarize each data set and extract valuable insights.

activity %>% 
  select(TotalSteps,
         TotalDistance,
         VeryActiveMinutes,
         FairlyActiveMinutes,
         LightlyActiveMinutes,
         SedentaryMinutes,
         Calories, METs) %>% 
  summary()

##    TotalSteps    TotalDistance    VeryActiveMinutes FairlyActiveMinutes
##  Min.   :    8   Min.   : 0.010   Min.   :  0.00    Min.   :  0.00     
##  1st Qu.: 5267   1st Qu.: 3.620   1st Qu.:  0.00    1st Qu.:  0.00     
##  Median : 8538   Median : 6.030   Median :  9.00    Median :  9.00     
##  Mean   : 8761   Mean   : 6.307   Mean   : 24.86    Mean   : 15.67     
##  3rd Qu.:11423   3rd Qu.: 8.160   3rd Qu.: 38.00    3rd Qu.: 22.00     
##  Max.   :36019   Max.   :28.030   Max.   :210.00    Max.   :143.00     
##  LightlyActiveMinutes SedentaryMinutes    Calories         METs       
##  Min.   :  0.0        Min.   :   0     Min.   : 257   Min.   :  6.00  
##  1st Qu.:156.0        1st Qu.: 716     1st Qu.:1899   1st Qu.: 10.00  
##  Median :217.0        Median : 981     Median :2275   Median : 10.00  
##  Mean   :219.9        Mean   : 940     Mean   :2422   Mean   : 16.27  
##  3rd Qu.:279.0        3rd Qu.:1167     3rd Qu.:2889   3rd Qu.: 13.00  
##  Max.   :518.0        Max.   :1440     Max.   :4900   Max.   :157.00

From the above summary, following obervations can be drawn:

Activity Levels

Active minutes indiciates that how active users were throughout the day.It can be observed that participants on an average are very or fairly active for approximately 9 min in a day that is very less.
The majority of participants engaged in light activity, with a mean of 219.9 minutes spent lightly active per day.

Intensity of Acitivity

MET means amount of energy used while sitting quietly. Average of 16.27 suggests that participants perform moderate level of overall physical activity, as it falls between sedentary behavior and vigorous activity.

Sedentary Behavior

Sedentary minutes refer to the duration of time spent engaging in activities with very low levels of physical movement or energy expenditure, such as sitting, reclining, or lying down.
It can be observed that users spent considerable amount of minutes in a day on low level physical activities. May be most of the users can be working professionals whose major time is spent on sitting activities.

Calories Expenditure

On an average, users burned a mean of 2,422 calories per day. Given the predominant sedentary behavior observed, it can be inferred that a significant portion of these calories are expended during non-physical activities.

heart_rate %>% 
  select(Value) %>% 
  summary()

##      Value       
##  Min.   : 36.00  
##  1st Qu.: 63.00  
##  Median : 73.00  
##  Mean   : 77.36  
##  3rd Qu.: 88.00  
##  Max.   :203.00

Observations:

The maximum heart rate, 203 rate measure after every 5 seconds, significantly exceeds the 3rd quartile range of 88 rate per 5 seconds. This suggests either high-intensity physical activity among some users or the presence of outliers in the data.
A normal resting heart rate for adults ranges from 60-100 beats per minute. With an average heart rate of 77 beats measured after every 5 seconds, it indicates a relatively healthy resting heart rate. However, individual variations and factors such as age, fitness level, and health conditions can influence heart rate measurements.

sleep_day_new %>% 
  select(TotalMinutesAsleep,TotalTimeInBed) %>% 
  summary()

##  TotalMinutesAsleep TotalTimeInBed 
##  Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:361.0      1st Qu.:403.8  
##  Median :432.5      Median :463.0  
##  Mean   :419.2      Mean   :458.5  
##  3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :796.0      Max.   :961.0

Observations

On average, users sleep for 419 minutes (7hours) out of 458 minutes(7.6 hours) in bed, indicating significant time spent in bed.
The average sleep duration is approximately 7 hours, aligning with typical sleep requirements.
However, some users spend 13-16 hours in bed (max), exceeding the expected 8-9 hours(3rd quartile), suggesting low physical activity levels and excessive time spent sleeping.

weight_info_new %>% 
  select(WeightKg,BMI) %>% 
  summary()

##     WeightKg           BMI       
##  Min.   : 52.60   Min.   :21.45  
##  1st Qu.: 61.40   1st Qu.:23.96  
##  Median : 62.50   Median :24.39  
##  Mean   : 72.04   Mean   :25.19  
##  3rd Qu.: 85.05   3rd Qu.:25.56  
##  Max.   :133.50   Max.   :47.54

Observations

1.The symmetry between the average BMI (25.19) and the third quartile BMI (25.5) suggests a balanced distribution, with the middle 50% of the data evenly spread around the mean. An extreme value with a BMI of 47.5 indicates outliers.

Most users fall within the normal (18.5-24.9) or overweight (25-29.9) BMI categories, indicating potential health risks for some individuals.
With an average weight of 72 kg, interpretation is limited without considering factors such as age and gender, which significantly influence weight distribution and health assessments.

Phase 5: Share

During this phase, we’ll generate visualizations that are accessible and comprehensible to our audience, stakeholders and marketing team. These visualizations aim to present insights in an interactive and user-friendly manner. The analysis will be shared with the audience via open platforms in R Markdown format for easy access and understanding.

Total Distance vs Calories burned

ggplot(data = daily_activity, aes(TotalDistance, Calories) ) +
  geom_point(color = "blue") + geom_smooth(method = "lm", se = FALSE) + geom_jitter()

## `geom_smooth()` using formula = 'y ~ x'

Observations The scatter plot above illustrates a positive correlation between the total distance covered and the calories burned by users. It suggests that as the distance covered increases, there is a corresponding increase in the number of calories burned.

Activity minutes over a timeline

#Aggregating daily_activity table by Date 
int_new <- daily_activity %>%  
  group_by(Date) %>%
  drop_na() %>% 
  summarise(VeryActiveMinutes = sum(VeryActiveMinutes),
            FairlyActiveMinutes=sum(FairlyActiveMinutes),
            LightlyActiveMinutes=sum(LightlyActiveMinutes),
            SedentaryMinutes=sum(SedentaryMinutes))

ggplot(int_new, aes(x = Date)) +
  geom_point(aes(y = VeryActiveMinutes, color = "Very Active")) +
  geom_point(aes(y = FairlyActiveMinutes, color = "Fairly Active")) +
  geom_point(aes(y = LightlyActiveMinutes, color = "Lightly Active")) +
  geom_point(aes(y = SedentaryMinutes, color = "Sedentary")) +
  labs(x = "Date", y = "Minutes", title = "Activity Minutes Over Time") +
  scale_color_manual(name = "Activity Type", 
                     values = c("Very Active" = "blue", 
                                "Fairly Active" = "red", 
                                "Lightly Active" = "green", 
                                "Sedentary" = "orange")) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1), legend.position = "right")

Observations

The scatter plot reveals a notable trend: the predominant portion of users’ daily activities is dedicated to sedentary activities This suggests a strong inclination towards indoor, stationary activities over physically active outdoor endeavors among the users in the dataset.

Conversely, the time allocated to “Fairly Active” and “Very Active” minutes appears minimal in comparison to the substantial duration devoted to sedentary activities throughout the day.

Heart Rate Categorization

heart_rate <- heart_rate %>%
  mutate(Category = case_when(
   Value < 60 ~ "Low Rate",
    between(Value, 60, 100) ~ "Normal Rate",
    Value > 100 ~ "High Rate"
  ))

#Count of Low rate values
Low_rate <- heart_rate %>%    #Count of Low rate values
           filter(Category == "Low Rate") %>% 
           nrow()

#Count of Normal rate values
Normal_rate <- heart_rate %>%   
           filter(Category == "Normal Rate") %>% 
           nrow()

#Count of High rate values  
High_rate <- heart_rate %>%    
           filter(Category == "High Rate") %>% 
           nrow()
#Total Count 
Total <- Low_rate + Normal_rate + High_rate

#% of each category out of total

Low_rate_perc <- (Low_rate/Total)*100 
Normal_rate_perc <- (Normal_rate/Total)*100 
High_rate_perc <- (High_rate/Total)*100 

labels <- c("Low rate", "Normal rate", "High rate")
lbls <- c(Low_rate_perc, Normal_rate_perc, High_rate_perc) #% values of heart rate in each category
lbls <- paste(round(lbls, 2), "%", sep="") # Added % sign after each category value
labels<- paste(labels, lbls, sep="-") # Printing both % and label name with "-" separator

#Pie Chart

pie(c(Low_rate_perc, Normal_rate_perc, High_rate_perc), 
    labels = labels, 
    col =c("navy blue", "orange", "maroon"),
    main = "Heart Rate Distribution")

Observations

The data reveals that approximately 74% of users maintain a normal heart rate throughout the day, suggesting they engage in physical activities within a healthy range.
Conversely, 11% of users exhibit high heart rates, which may imply their participation in more vigorous physical activities or experiencing heightened stress levels during their activities.
Additionally, 15% of users display low heart rates, hinting at potential engagement in more sedentary or low-intensity activities, contributing to the lower heart rates observed.

Sleep schedule chart

#summarising sleep table by ID
sleep_new <- sleep_day_new %>%  
  group_by(Id) %>%
  drop_na() %>% 
  summarise(MinutesAsleep = sum(TotalMinutesAsleep),
            MinutesInBed=sum(TotalTimeInBed))

#Exctract last 3 digits of ID to look more clean 
sleep_new <- sleep_new %>%
  mutate(ID = substr(Id, nchar(Id) - 2, nchar(Id)))

# Combine data for MinutesInBed and MinutesAsleep into one data frame
sleep_new_long <- tidyr::pivot_longer(sleep_new, cols = c(MinutesInBed, MinutesAsleep), names_to = "Variable", values_to = "Minutes")

#Creating bar chart
ggplot(sleep_new_long, aes(x = ID, y = Minutes, fill = Variable)) +
  geom_bar(position = position_dodge(width=0.8), stat = "identity") +
  labs(x = "ID", y = "Minutes", title = "Sleep Schedule by ID over a Timeline") +
  scale_fill_manual(values = c("MinutesInBed" = "orange", "MinutesAsleep" = "maroon")) +
  geom_hline(yintercept = mean(sleep_new_long$Minutes), color = "black", linetype = "dashed") +  # Add mean line
  theme_linedraw() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

Observations

The data suggests that a significant portion of individuals predominantly utilize their time in bed for sleeping, indicating a prioritization of sleep within their daily routines.
A noteworthy finding is that more than half of the users exceed the average recommended sleep duration or spend an extended period in bed. This observation underscores the prevalence of extended sleep patterns within the studied population.
However, it’s essential to acknowledge the limitations of drawing definitive conclusions due to limited data availability for several IDs. The absence of comprehensive sleep records for certain individuals impedes accurate assessments and comparisons of sleep schedules across the dataset.
It’s evident from the bar chart that some IDs exhibit notably low total sleep or bed minutes. This discrepancy arises from the lack of complete sleep records for these individuals, highlighting the challenge of making comprehensive comparisons across all IDs.

MinutesAsleep vs Calories Burned

#Calories and Minutes Asleep from merging two tables by ID
calories_new <- daily_activity[, c("Id", "Calories")] 
sleep_minutes <- sleep_day_new[, c("Id", "TotalMinutesAsleep")] 
sleep_calories <- merge(calories_new, sleep_minutes, by = c("Id"))

#Scatter Plot
ggplot(data = sleep_calories, aes(TotalMinutesAsleep, Calories) ) +
  geom_point(color = "navy blue") + geom_smooth(method = "lm", se = FALSE) + labs(x = "Minutes Asleep", y = "Total Calories burned", title = "Sleep Schedule vs Calories Burned")

## `geom_smooth()` using formula = 'y ~ x'

Observations

The horizontal line signifies a lack of direct correlation between minutes asleep and calories burned, possibly due to data limitations or the intricate nature of their association. Despite varying sleep durations, calorie expenditure remains consistently high, indicating the influence of other factors. Clustering around the line suggests additional influences on calorie expenditure beyond sleep duration.

Categorisation of BMI

weight_info_new <- weight_info_new %>%
  mutate(Category = case_when(
    between(BMI, 18.5, 24.9) ~ "Normal Range",
    between(BMI, 25, 29.9)  ~ "Overweight",
    BMI> 29.9 ~ "Obese"
  ))

 #Count of Normal Range BMI
Normal_range <- weight_info_new %>%   
           filter(Category == "Normal Range") %>% 
           nrow()

#Count of Overweight users
Overweight <- weight_info_new %>%   
           filter(Category == "Overweight") %>% 
           nrow()

#Count of Obese users
Obese <- weight_info_new %>%    
           filter(Category == "Obese") %>% 
           nrow()
#Total Count 
Total <- Normal_range+Overweight+Obese

#% of each category out of total

Normal_range_perc <- (Normal_range/Total)*100 
Overweight_perc <- (Overweight/Total)*100 
Obese_perc <- (Obese/Total)*100 

BMI_labels <- c("Normal Range", "Overweight", "Obese")
BMI_lbls <- c(Normal_range_perc, Overweight_perc, Obese_perc) #% values of BMI in each category
BMI_lbls <- paste(round(BMI_lbls, 2), "%", sep="") # Added % sign after each category value
BMI_labels<- paste(BMI_labels, BMI_lbls, sep="-") # Printing both % and label name with "-" separator

#Bar Chart by each BMI category
barplot(c(Normal_range_perc, Overweight_perc, Obese_perc), 
        main = "BMI Categories",
        names.arg = BMI_labels,
        ylab = "Percentage",
        col = c("navy blue", "orange", "maroon"),
        ylim = c(0, 100),
        xpd = FALSE)

**Observations*

The data reveals that over half of the users fall within the normal range BMI (18.5-24.9), indicating potentially healthier lifestyles with increased physical activity and calorie expenditure.
Conversely, nearly half are classified as “Overweight,” highlighting a significant portion at risk of health complications related to excess weight.
A minimal 1% fall into the “Obese” category, emphasizing the urgency to address health concerns associated with obesity within this smaller subset.
The absence of fat data fro all users limits the comprehensive assessment of weight status, impacting the accuracy of BMI measurements and potentially obscuring insights into individuals’ health and fitness levels.

Phase 6: Act

Conclusion

Valuable insights into user activity levels, sleep patterns, heart rate and BMI distribution are provided by the analysis of smart device data.
A wide range of consumer needs and preferences could be addressed by Bellabeat’s products, in particular health and fitness tracking.

3.Product development and marketing strategies can be informed by an understanding of the most important trends, such as a high prevalence of normal body mass index and extensive use of sedentary activities.

4.The depth of analysis has been reduced and there is a need for further data collection efforts due to the lack of complete information, in particular on fats content, gender, age, etc.

Recommendations

Product Improvement: Develop features that encourage users to increase physical activity and improve sleep quality in line with the goal of promoting healthier lifestyles.
Goal Marketing: Adapt marketing campaigns to the specific needs of users in different BMI categories, highlighting the benefits. Bellabeat products to support overall wellness.
Data collection: Prioritize the collection of comprehensive user data, including body fat percentage, demographic data, to improve the accuracy and depth of analysis on smart devices.
Partnerships and collaborations: Explore partnerships with health and wellness organizations to leverage their health lifestyle expertise and expand Bellabeat’s market reach.
Customer Education: Provide educational resources and content to help users better understand the meaning of metrics like BMI, heart rate and activity levels that enable them to make informed decisions about health and fitness.

Google Course Capstone Project

Ridhima Bansal

2024-02-26

Bellabeat Case Study

About Project

Phase 1: Ask

Business Problem

Key Stakeholders

Deliverable

Phase 2: Prepare

Phase 3: Process

Loading Packages

Importing Data Sets

Verifying & Analyzing Datasets

Column Names for each Table

Identifying number of rows in each table

Identifying Duplicacy

Removing duplicate rows

Removing Missing Records

Joining `Daily_activity` and `MET_narrow`

Phase 4: Analyze

Phase 6: Act

Conclusion

Recommendations

Google Course Capstone Project

Ridhima Bansal

2024-02-26

Bellabeat Case Study

About Project

Phase 1: Ask

Business Problem

Key Stakeholders

Deliverable

Phase 2: Prepare

Phase 3: Process

Loading Packages

Importing Data Sets

Verifying & Analyzing Datasets

Column Names for each Table

Identifying number of rows in each table

Identifying Duplicacy

Removing duplicate rows

Removing Missing Records

Joining Daily_activity and MET_narrow

Phase 4: Analyze

Phase 5: Share

Total Distance vs Calories burned

Activity minutes over a timeline

Heart Rate Categorization

Sleep schedule chart

MinutesAsleep vs Calories Burned

Categorisation of BMI

Phase 6: Act

Conclusion

Recommendations

Joining `Daily_activity` and `MET_narrow`