Google Data Analytics Project: Bellabeat Analysis

Tarek ABOUNACEUR

November 28, 2021

Problem Statement

Analysis of one of Bellabeat’s products to gain insights into how the consumers are using their smart devices, and thus, provide business solutions and marketing strategies for the company to increase sales and reveal more growth opportunities.

Analysis Approach

Analyzing smart device usage data to spot trends on how the consumers are using the product, and then use the trends to influence Bellabeat marketing strategy and come up with business recommendations.

Stakeholders:

There are primary and secondary stakeholders for this project:

Primary:
- Urška Sršen: Bellabeat’s co-founder and Chief Creative Officer.
- Sando Mur: Mathematician and Bellabeat’s cofounder; a key member of the Bellabeat executive team.
Secondary:
- Bellabeat marketing analytics team.

Data Source

The Fitbit Fitness Tracker Data is an open-source dataset available on Kaggle. It was generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016 and 05.12.2016. It contains a personal fitness tracker from 30 eligible Fitbit users who consented to the submission of personal tracker data. The dataset is made of 18 CSV files containing different customers’ data, including minute-level output for physical activity, heart rate, and sleep monitoring. The data was gathered in hourly, weekly, and monthly basis.

Data Credibility

The ROCCC parameters need to be present to assume that the data is credible.

Reliable: Data source is not reliable since it only collected the data from 30 users which will not represent the whole population. The results will be biased to a certain extent.
Original: Data was collected through the Amazon Mechanical Turk survey, and thus, making the data not original since they are second- or third-party information.
Comprehensive: There is some important missing information about the users such as age, and gender, which will make the data less comprehensive leading to less accurate conclusions.
Current: Data is from 2016 and might give not-so-efficient business recommendations now.
Cited: Data is not cited. There is only the name of the survey that appears, which makes it difficult to assume that the data is credible.

Data Storage

The data is stored in spreadsheets CSV files. It will be hard to process it with spreadsheets. Using either SQL or R would be better for this analysis.

Data Sorting and Filtering

The analysis will be made using R. The first step would be to load the data into R environment and then have a first look.

Loading the readr package and the csv files

library(readr)
daily_activity <- read_csv("C:/Users/herot/Desktop/Fitabase Data/dailyActivity_merged.csv")
daiy_calories <- read_csv("C:/Users/herot/Desktop/Fitabase Data/dailyCalories_merged.csv")
daily_intensities <- read_csv("C:/Users/herot/Desktop/Fitabase Data/dailyIntensities_merged.csv")
daily_steps <- read_csv("C:/Users/herot/Desktop/Fitabase Data/dailySteps_merged.csv")
daily_sleep <- read_csv("C:/Users/herot/Desktop/Fitabase Data/sleepDay_merged.csv")
heartrate_seconds <- read_csv("C:/Users/herot/Desktop/Fitabase Data/heartrate_seconds_merged.csv")
hourly_calories <- read_csv("C:/Users/herot/Desktop/Fitabase Data/hourlyCalories_merged.csv")
hourly_intensities <- read_csv("C:/Users/herot/Desktop/Fitabase Data/hourlyIntensities_merged.csv")
hourly_steps <- read_csv("C:/Users/herot/Desktop/Fitabase Data/hourlySteps_merged.csv")
minute_calories_narrow <- read_csv("C:/Users/herot/Desktop/Fitabase Data/minuteCaloriesNarrow_merged.csv")
minute_calories_wide <- read_csv("C:/Users/herot/Desktop/Fitabase Data/minuteCaloriesWide_merged.csv")
minute_intensities_narrow <- read_csv("C:/Users/herot/Desktop/Fitabase Data/minuteIntensitiesNarrow_merged.csv")
minute_intensities_wide <- read_csv("C:/Users/herot/Desktop/Fitabase Data/minuteIntensitiesWide_merged.csv")
minute_mets_narrow <- read_csv("C:/Users/herot/Desktop/Fitabase Data/minuteMETsNarrow_merged.csv")
minute_sleep <- read_csv("C:/Users/herot/Desktop/Fitabase Data/minuteSleep_merged.csv")
minute_steps_narrow <- read_csv("C:/Users/herot/Desktop/Fitabase Data/minuteStepsNarrow_merged.csv")
minute_steps_wide <- read_csv("C:/Users/herot/Desktop/Fitabase Data/minuteStepsWide_merged.csv")
weight_log <- read_csv("C:/Users/herot/Desktop/Fitabase Data/weightLogInfo_merged.csv")

After a first observation of the data, only 4 files seem to be relevant for our business task: daily_activity which contains all the variables into one file grouped daily, heartrate_seconds, daily_sleep, and weight_log.

Removing the unnecessary data from the environment

rm(daily_intensities)
rm(daily_steps)
rm(hourly_calories)
rm(hourly_intensities)
rm(hourly_steps)
rm(minute_calories_narrow)
rm(minute_calories_wide)
rm(minute_intensities_narrow)
rm(minute_intensities_wide)
rm(minute_mets_narrow)
rm(minute_sleep)
rm(minute_steps_narrow)
rm(minute_steps_wide)
rm(daiy_calories)

Data Cleaning

Loading necessary packages

library(tidyverse)
library(lubridate)
library(hrbrthemes)
library(corrplot)
library(ggcorrplot)
library(viridis)

Splitting date and time into separate columns

weight_log <- weight_log %>% 
  separate(Date, c("Date", "Time"), " ")

heartrate_seconds <- heartrate_seconds %>% 
  separate(Time, c("Date", "Time"), " ")

daily_sleep <- daily_sleep %>% 
  separate(SleepDay, c("Date", "Time"), " ")
daily_sleep <- subset(daily_sleep, select=-Time)

For the daily sleep data, the time column is irrelevant for the analysis because the data is daily. It could be dropped from the table without affecting the analysis.

Calculating the average daily heart rate for each person

heartrate_daily <-
  tibble(heartrate_seconds %>%
           group_by(Date, Id) %>%
           summarise(Mean_Heartrate=(mean(Value))))

Dividing heart rate data into morning, afternoon, evening, and night

heartrate_time <- read_csv("C:/Users/herot/Desktop/Fitabase Data/heartrate_seconds_merged.csv")
heartrate_time$time <- dmy_hms(heartrate_time$Time)
heartrate_time <- na.omit(heartrate_time) ## remove missing values
breaks <- hour(hm("6:00", "12:00", "16:00", "19:00", "23:59"))
labels <- c("Morning", "Afternoon", "Evening", "Night")
heartrate_time$Time_of_day <- cut(x=hour(heartrate_time$time), breaks = breaks, labels = labels, include.lowest = TRUE)
heartrate_time <- heartrate_time %>% drop_na()

Grouping heart rate data according to their time period

heartrate_grouped <-
  tibble(heartrate_time %>% 
           group_by(Time_of_day) %>% 
           summarise(heartrate_mean=(mean(Value))))
heartrate_grouped <- heartrate_grouped %>% drop_na()

Time_of_day	heartrate_mean
Morning	78.31348
Afternoon	81.17850
Evening	84.13020
Night	76.59507

Finding duplicates in each data frame

nrow(daily_activity[duplicated(daily_activity),])

## [1] 0

nrow(heartrate_daily[duplicated(heartrate_daily),])

## [1] 0

nrow(daily_sleep[duplicated(daily_sleep),])

## [1] 3

nrow(weight_log[duplicated(weight_log),])

## [1] 0

The sleep dataset has 3 duplicates, those should be removed to avoid skewed metrics and therefore wrong conclusions.

Removing duplicates from the daily_sleep data frame

daily_sleep <- dplyr::distinct(daily_sleep)

Finding null values in each data frame

which(is.na(daily_activity))

## integer(0)

which(is.na(heartrate_daily))

## integer(0)

which(is.na(daily_sleep))

## integer(0)

which(is.na(weight_log))

##  [1] 337 338 339 340 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356
## [20] 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375
## [39] 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394
## [58] 395 396 397 398 399 400 401 402

The null values are only present in the weight dataset. Finding which column has the null values and then removing it.

colnames(weight_log)[colSums(is.na(weight_log)) > 0]

## [1] "Fat"

weight_log <- select(weight_log, -Fat)

Creating a data frame with common users from all the data frames

unique_dataframe <- merge(daily_activity, daily_sleep, by.x=c("Id", "ActivityDate"), by.y=c("Id", "Date"))
unique_dataframe <- merge(unique_dataframe, select(weight_log, -Time), by.x=c("Id", "ActivityDate"), by.y=c("Id", "Date"))
unique_dataframe <- merge(unique_dataframe, heartrate_daily, by.x=c("Id", "ActivityDate"), by.y=c("Id", "Date"))

Data Analysis

Finding the unique number of persons in each data frame

length(unique(daily_activity$Id))

## [1] 33

length(unique(heartrate_daily$Id))

## [1] 14

length(unique(daily_sleep$Id))

## [1] 24

length(unique(weight_log$Id))

## [1] 8

The highest number of participants that took part in the survey is 33, while only 3 of those participants took part in all the surveys.

Data frames summary

Daily activity summary

daily_activity %>%
  select(TotalSteps,
         TotalDistance,
         TrackerDistance,
         VeryActiveMinutes,
         FairlyActiveMinutes,
         LightlyActiveMinutes,
         SedentaryMinutes,
         Calories) %>% 
  summary()

##    TotalSteps    TotalDistance    TrackerDistance  VeryActiveMinutes
##  Min.   :    0   Min.   : 0.000   Min.   : 0.000   Min.   :  0.00   
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 2.620   1st Qu.:  0.00   
##  Median : 7406   Median : 5.245   Median : 5.245   Median :  4.00   
##  Mean   : 7638   Mean   : 5.490   Mean   : 5.475   Mean   : 21.16   
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.: 7.710   3rd Qu.: 32.00   
##  Max.   :36019   Max.   :28.030   Max.   :28.030   Max.   :210.00   
##  FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes    Calories   
##  Min.   :  0.00      Min.   :  0.0        Min.   :   0.0   Min.   :   0  
##  1st Qu.:  0.00      1st Qu.:127.0        1st Qu.: 729.8   1st Qu.:1828  
##  Median :  6.00      Median :199.0        Median :1057.5   Median :2134  
##  Mean   : 13.56      Mean   :192.8        Mean   : 991.2   Mean   :2304  
##  3rd Qu.: 19.00      3rd Qu.:264.0        3rd Qu.:1229.5   3rd Qu.:2793  
##  Max.   :143.00      Max.   :518.0        Max.   :1440.0   Max.   :4900

Observations:

The average daily total steps made by each person is 7639.
The average daily total distance and daily tracked distance are significantly the same.
The average daily calories burnt by each person is 2304.
The sample size is 33 person.

Daily sleep summary

daily_sleep %>% 
  select(TotalSleepRecords,
         TotalMinutesAsleep,
         TotalTimeInBed) %>% 
  summary()

##  TotalSleepRecords TotalMinutesAsleep TotalTimeInBed 
##  Min.   :1.00      Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:1.00      1st Qu.:361.0      1st Qu.:403.8  
##  Median :1.00      Median :432.5      Median :463.0  
##  Mean   :1.12      Mean   :419.2      Mean   :458.5  
##  3rd Qu.:1.00      3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :3.00      Max.   :796.0      Max.   :961.0

Observations:

The average sleeping time is 419.5 minutes, which is 7 hours.
The average time passed in bed is 458.6 minutes which is around 7 hours and 40 minutes.
On average, each person tends to pass 40 minutes in bed before falling to sleep.
The sample size is 24 person.

Average heart rate summary

summary(heartrate_daily$Mean_Heartrate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   59.38   70.47   77.49   78.61   84.93  109.79

Observations:

The average daily heart rate is 78.61.
The sample size is 14 person.

Heart rate based on time of the day summary

heartrate_time %>% 
  select(Value,
         Time_of_day) %>% 
  summary()

##      Value          Time_of_day    
##  Min.   : 38.0   Morning  :329894  
##  1st Qu.: 66.0   Afternoon:208150  
##  Median : 77.0   Evening  :150108  
##  Mean   : 79.8   Night    :139066  
##  3rd Qu.: 90.0                     
##  Max.   :199.0

heartrate_grouped

## # A tibble: 4 x 2
##   Time_of_day heartrate_mean
##   <fct>                <dbl>
## 1 Morning               78.3
## 2 Afternoon             81.2
## 3 Evening               84.1
## 4 Night                 76.6

Observations:

Morning seems to have the most count of daily heart rate with 329894 entry.
The average daily heart rate recorded during the night is 76.6 which is the lowest one because most of the persons are asleep.
The average daily heart rate recorded during the evening is 84.1 which is the highest one.

Weight summary

weight_log %>% 
  select(WeightKg,
         BMI) %>% 
  summary()

##     WeightKg           BMI       
##  Min.   : 52.60   Min.   :21.45  
##  1st Qu.: 61.40   1st Qu.:23.96  
##  Median : 62.50   Median :24.39  
##  Mean   : 72.04   Mean   :25.19  
##  3rd Qu.: 85.05   3rd Qu.:25.56  
##  Max.   :133.50   Max.   :47.54

Observations:

The average daily weight log in kg is 72.04.
The average body mass index is 25.19, which stands at the lowest end of the overweight range of 25 and 29.9.
The sample size is 8 persons.

Data Visualization

Checking for a correlation between the total Steps and the calories burnt

viz1 <- ggplot(data=daily_activity, aes(x=TotalSteps, y=Calories))+
  geom_point(colour="yellow")+
  geom_smooth(method=lm, colour="white")+
  labs(title = "Total Steps VS Calories", x="Total Steps", y="Calories")+
  scale_x_comma()+
  theme_ft_rc()
plot(viz1)

## `geom_smooth()` using formula 'y ~ x'

The plot shows that there is a positive correlation between the total steps and the calories burnt. However, there seem to be certain outliers that do not follow the fore mentioned correlation.

Comparing the distance tracked by the smart watch and the actual total distance traveled by the consumer

viz2 <- ggplot(data=daily_activity, aes(x=TrackerDistance, y=TotalDistance))+
  geom_point(colour="yellow")+
  geom_smooth(method=lm, colour="white")+
  labs(title = "Total Distance VS Tracker Distance", x="Tracker Distance", y="Total Distance")+
  scale_x_comma()+
  theme_ft_rc()
plot(viz2)

## `geom_smooth()` using formula 'y ~ x'

The plot shows that the tracked distance and total distance are almost identical. This means that the Bellabeat smartwatch is recording almost perfectly the steps performed by the users. In certain cases, the total distance is greater than the tracked distance, and this could be because of human error, i.e. the users possibly forgot to wear the smartwatch for a certain amount of time.

Checking for a correlation between the active minutes and the calories burnt

viz3 <- ggplot(data=daily_activity, aes(x=VeryActiveMinutes, y=Calories))+
  geom_point(colour="yellow")+
  geom_smooth(colour="white")+
  labs(title="Very Active Minutes VS Calories", x= "Very Active Minutes")+
  theme_ft_rc()

viz4 <- ggplot(data=daily_activity, aes(x=FairlyActiveMinutes, y=Calories))+
  geom_point(colour="yellow")+
  geom_smooth(colour="white")+
  labs(title="Fairly Active Minutes VS Calories", x= "Fairly Active Minutes")+
  theme_ft_rc()

viz5 <- ggplot(data=daily_activity, aes(x=LightlyActiveMinutes, y=Calories))+
  geom_point(colour="yellow")+
  geom_smooth(colour="white")+
  labs(title="Lightly Active Minutes VS Calories", x= "Lightly Active Minutes")+
  theme_ft_rc()

plot(viz3)

plot(viz4)

plot(viz5)

The three plots show that the very active minutes and lightly active minutes are positively correlated with the calories burnt. As for the fairly active minutes, it is negatively correlated with the calories burnt. Most of the calories distribution of very active minutes and fairly active minutes is around 0, however, for the lightly active minutes, the calories distribution is around it.

Comparing the total minutes asleep and the total time in bed

viz6 <- ggplot(data=daily_sleep, aes(x=TotalMinutesAsleep, y=TotalTimeInBed))+
  geom_point(colour="yellow")+
  geom_smooth(method=lm, colour="white")+
  labs(title="Total Minutes in Bed VS Total Minutes Asleep", x="Total Minutes Asleep", y="Total Minutes in Bed")+
  theme_ft_rc()
plot(viz6)

## `geom_smooth()` using formula 'y ~ x'

The plot shows that the total minutes asleep and the time passed in bed are almost identical, except for some outliers.

Checking for a correlation between the active minutes and the total minutes asleep

viz7 <- ggplot(data=unique_dataframe, aes(x=VeryActiveMinutes, y=TotalMinutesAsleep))+
  geom_point(colour="yellow")+
  geom_smooth(colour="white")+
  labs(title="Total Minutes Asleep VS Very Active Minutes", x= "Very Active Minutes", y="Total Minutes Asleep")+
  theme_ft_rc()

viz8 <- ggplot(data=unique_dataframe, aes(x=FairlyActiveMinutes, y=TotalMinutesAsleep))+
  geom_point(colour="yellow")+
  geom_smooth(colour="white")+
  labs(title="Total Minutes Asleep VS Fairly Active Minutes", x= "Fairly Active Minutes", y="Total Minutes Asleep")+
  theme_ft_rc()

viz9 <- ggplot(data=unique_dataframe, aes(x=LightlyActiveMinutes, y=TotalMinutesAsleep))+
  geom_point(colour="yellow")+
  geom_smooth(colour="white")+
  labs(title="Total Minutes Asleep VS Lightly Active Minutes", x= "Lightly Active Minutes", y="Total Minutes Asleep")+
  theme_ft_rc()

plot(viz7)

plot(viz8)

plot(viz9)

The three plots show that the very active minutes and the fairly active minutes are positively correlated with the total minutes asleep, as opposed to the lightly active minutes which is negatively correlated with the total minutes asleep.

Correlation Matrix

variables_matrix <- select(daily_activity, TotalSteps, TotalDistance, VeryActiveDistance, ModeratelyActiveDistance, LightActiveDistance, VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes, Calories)
correlation_matrix = cor(variables_matrix)
ggcorrplot(correlation_matrix, lab = TRUE)+
  scale_fill_gradient2(low = "red", high = "darkslateblue", mid = "white")

The correlation matrix shows that total steps, total distance, and very active minutes with a correlation coefficient of 0.59, 0.64, and 0.62 respectively, are highly correlated with the burnt calories.

Conclusions for the data analysis

The difference between the total time in bed and the total time asleep is low, with only 40 minutes in average. The average total time asleep is 7 hours which depicts that most of the persons are having sufficient sleep which is good for their health.
The average body mass index recorded is 25.19, which falls in the overweight range between 25 and 29.9. This depicts that most of the persons are either not following a good diet, or not doing enough exercise.
The tracked distance and the total steps are almost identical, this means that the Bellabeat smartwatch is accurate and does not need any improvement.
The sample size is considerably low, and the conclusions driven based on the analysis of the sample might not be accurate, with only 3 persons in total that completed all the surveys.

Business insights for the Bellabeat Marketing Analytics Team

A mobile push notification for reminding the users to be active could be a good idea since there is a correlation between the total steps and the calories burnt.
Adding daily steps goals and achievements to the mobile application to incentivize the users to be active.
The maximum heart rate recorded was 200 which is very high and alarming. An alert system that triggers when the heart rate exceeds a certain threshold would be a good feature to implement as it could save lives.
The average total time asleep of 7 hours is good. In order to let the users keep the same sleep routine, a sleep time reminder could be added to the application.
Performing other surveys to gather more data, such as the age and the gender, and from a bigger sample size might be important in order to do more targeted improvements. To encourage users to complete all the surveys, rewards should be given to whoever completed all of them. The rewards will be distributed from a budget specifically allocated to this matter.

LinkedIn: Tarek Abounaceur