This analysis project is done as a part of Google data analytics professional course offered by Coursera.*
You are a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, co-founder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. You have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights you discover will then help guide marketing strategy for the company. You will present your analysis to the Bellabeat executive team along with your high-level recommendations for Bellabeat’s marketing strategy.
Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that
manufactures health-focused smart products. Sršen used her background as
an artist to develop beautifully designed technology that informs and
inspires women around the world. Collecting data on activity, sleep,
stress, and reproductive health has allowed Bellabeat to empower women
with knowledge about their own health and habits. Since it was founded
in 2013, Bellabeat has grown rapidly and quickly positioned itself as a
tech-driven wellness company for women.
By 2016, Bellabeat had opened offices around the world and launched
multiple products. Bellabeat products became available through a growing
number of online retailers in addition to their own e-commerce channel
on their website. The company has invested in traditional advertising
media, such as radio, out-of-home billboards, print, and television, but
focuses on digital marketing extensively. Bellabeat invests year-round
in Google Search, maintaining active Facebook and Instagram pages, and
consistently engages consumers on Twitter. Additionally, Bellabeat runs
video ads on Youtube and display ads on the Google Display Network to
support campaigns around key marketing dates.
Sršen knows that an analysis of Bellabeat’s available consumer data
would reveal more opportunities for growth. She has asked the marketing
analytics team to focus on a Bellabeat product and analyze smart device
usage data in order to gain insight into how people are already using
their smart devices. Then, using this information, she would like
high-level recommendations for how these trends can inform Bellabeat
marketing strategy.
To analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices, and apply the insights to one of the bellabeat products.
To view and know the integrity of the datasets we are about to use
for our analysis.
The data source for this project is made available through Mobius FitBit Fitness
Tracker Data.
This dataset generated by respondents to a distributed survey via Amazon
Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit
users consented to the submission of personal tracker data, including
minute-level output for physical activity, heart rate, and sleep
monitoring. Individual reports can be parsed by export session ID
(column A) or timestamp (column B). Variation between output represents
use of different types of Fitbit trackers and individual tracking
behaviors / preferences.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(readr)
library(skimr)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(ggplot2)
library(rmarkdown)
daily_activity <- read.csv("dailyActivity_merged.csv")
calories <- read.csv("dailyCalories_merged.csv")
intensity <- read.csv("dailyIntensities_merged.csv")
daily_steps <- read.csv("dailySteps_merged.csv")
sleep_rate <- read.csv("sleepDay_merged.csv")
weight_log <- read.csv("weightLogInfo_merged.csv")
head(daily_activity)
head(calories)
head(intensity)
head(daily_steps)
head(sleep_rate)
head(weight_log)
n_distinct(daily_activity$Id)
## [1] 33
n_distinct(calories$Id)
## [1] 33
n_distinct(intensity$Id)
## [1] 33
n_distinct(daily_steps$Id)
## [1] 33
n_distinct(sleep_rate$Id)
## [1] 24
n_distinct(weight_log$Id)
## [1] 8
24 out of 33 users have provided their sleep data and 8 out of 33 users have provided their weight and BMI data.
Checking the data integrity and credibility using ROCCC
approach.
1. Reliable - Data set is not reliable due to its
sample size being very low, 33 and is not collective representation of
the whole population.
2. Original - Data set is collected through a survey
via Amazon mechanical turk, which may or may not be original hence
collected thorough second or third party inputs.
3. Comprehensive - Inadequate information. Some of the
most crucial information to solve the given question like age, gender
and location are unavailable.
4. Current - The data is survey on 2016 and hence
outdated.
5. Cited - A cited source is not mentioned, hence it is
difficult to confirm its credibility.
The activity_report already contains the data related to calories, intensity, and daily steps. To Verify if the data is same on both the tables, we can check if the number of columns and Id’s match with each other.
nrow(daily_activity)
## [1] 940
nrow(calories)
## [1] 940
nrow(intensity)
## [1] 940
nrow(daily_steps)
## [1] 940
All the rows and data matches with the number of data
Cleaning the dataset of any duplicated values.
sum(duplicated(daily_activity))
## [1] 0
sum(duplicated(sleep_rate))
## [1] 3
sum(duplicated(weight_log))
## [1] 0
Sleep_rate has 3 duplicated values, therefore removing the duplicate values.
sleep_rate <-
sleep_rate %>%
distinct()
sum(duplicated(sleep_rate))
## [1] 0
Cleaning the data set of any null values.
sum(is.na(daily_activity))
## [1] 0
sum(is.na(sleep_rate))
## [1] 0
sum(is.na(weight_log))
## [1] 65
str(weight_log)
## 'data.frame': 67 obs. of 8 variables:
## $ Id : num 1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
## $ Date : chr "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
## $ WeightKg : num 52.6 52.6 133.5 56.7 57.3 ...
## $ WeightPounds : num 116 116 294 125 126 ...
## $ Fat : int 22 NA NA NA NA 25 NA NA NA NA ...
## $ BMI : num 22.6 22.6 47.5 21.5 21.7 ...
## $ IsManualReport: chr "True" "True" "False" "True" ...
## $ LogId : num 1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...
Fat data is not provided by many users nd since we are not using ‘fat’ data for this analysis, it is excluded.
weight_log <- weight_log %>%
select(-Fat)
sum(is.na(weight_log))
## [1] 0
Separating date and time values from weight and sleep data sets.
weight_log<-
weight_log %>%
separate(Date, c("Date", "Time")," ") %>%
select(-IsManualReport, -WeightPounds, -LogId, -Time)
## Warning: Expected 2 pieces. Additional pieces discarded in 67 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
head(weight_log)
sleep_rate<-
sleep_rate %>%
separate("SleepDay", c("Date", "Time"), " ")
## Warning: Expected 2 pieces. Additional pieces discarded in 410 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
head(sleep_rate)
glimpse(daily_activity)
## Rows: 940
## Columns: 15
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ TotalSteps <int> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes <int> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…
tracker_usage <- daily_activity %>%
group_by(Id) %>%
summarize(active_days = n_distinct(ActivityDate))
tracker_usage <- tracker_usage %>%
mutate (usage = case_when(
active_days <= 10 ~ 'Low',
active_days > 10 & active_days <= 20 ~ 'Moderate',
active_days > 20 ~ 'High')) %>%
group_by(usage)
head(tracker_usage)
ggplot(tracker_usage, aes(x=" ", y=usage, fill=usage))+
geom_bar(stat = "identity")+
coord_polar("y")+
labs(x=NULL, y=NULL, title = "Tracker Usage", caption ="33 User Data")+
theme(axis.ticks=element_blank(),
axis.text.x=element_blank(),
legend.position="top",
legend.title = element_blank())
The pie chart shows the user’s usage of tracker on daily basis, nearly 3 out of 4 users have a high usage of fitness tracker.
sleep_day <- sleep_rate %>%
mutate(day = weekdays(as.Date(Date, format = "%m/%d/%Y")))
sleep_day <- sleep_day %>%
group_by(day) %>%
summarize(days_used = n()) %>%
arrange(day)
sleep_day$day <-ordered(sleep_day$day, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
sleep_day <- sleep_day %>%
arrange(day)
ggplot(data = sleep_day) +
geom_col(aes(x=day, y=days_used), fill='#FFDEB4') +
geom_text(aes(x=day, y=days_used, label = days_used), position = position_fill(vjust = 30)) +
labs(
x="Days of the week",
y="Usage",
title = "TRACKER USAGE OVER THE WEEK"
)
Tracker usage is high mid-week and comparably it is lower during weekends and Mondays.
tracker_variance <-
daily_activity %>%
filter(TotalDistance!=TrackerDistance)
Red points denote the outliers of inaccurate measurements.
ggplot(daily_activity)+
geom_point(mapping = aes(x=TotalDistance, y=TrackerDistance))+
geom_point(data=tracker_variance, mapping=aes(x=TotalDistance, y=TrackerDistance), color='red')+
labs(
x="Total Distance",
y="Tracker Measurement",
title="Tracker Measurement Variance",
)
Few data are not matching and the tracker measured distance is less
than the actual distance covered.
The variation is minor and hence will not be affecting the results.
ggplot(daily_activity, aes(x=TotalDistance, y=Calories))+
geom_point(color="cyan")+
geom_smooth()+
labs(title="Distance vs Calories Burned", subtitle = "940 observations of 33 users")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
To analyze the impact of distance covered to that of calories burnt.
distance_report <-
daily_activity %>%
group_by(Id) %>%
summarize(total_distance=sum(TotalDistance), total_steps=sum(TotalSteps), calories=sum(Calories),
very_active=sum(VeryActiveDistance), moderately_active=sum(ModeratelyActiveDistance),
light_active=sum(LightActiveDistance))
distance_report <-
distance_report %>%
mutate(activity=case_when(
calories>80000 ~ "intense",
calories<80000 & calories>50000 ~ "moderate",
calories<50000 ~ "light")) %>%
mutate(distance_range=case_when(
total_distance>350 ~ ">350km",
total_distance<350 & total_distance>250 ~ "250-350km",
total_distance<250 & total_distance>150 ~ "150-250km",
total_distance<150 ~ "0-150km"))
head(distance_report)
ggplot(data=distance_report, aes(x=activity, fill=distance_range))+
geom_bar()+
labs(x="Intensity", title = "Intensity of calories burnt", subtitle = "940 observations of 33 users")+
theme(axis.text.y= element_blank(), axis.title.y = element_blank(), axis.title.x = element_text(face = "bold"), legend.title = element_blank())
An user’s activity is categorized as sedentary, light active, fairly active and very active based on their everyday count of steps.
daily_average <- daily_activity %>%
group_by(Id) %>%
summarise (mean_daily_steps = mean(TotalSteps), mean_daily_calories = mean(Calories))
user_type <- daily_average %>%
group_by(Id) %>%
mutate(user_activity = case_when(
mean_daily_steps < 5000 ~ "sedentary",
mean_daily_steps >= 5000 & mean_daily_steps < 7499 ~ "lightly active",
mean_daily_steps >= 7500 & mean_daily_steps < 9999 ~ "fairly active",
mean_daily_steps >= 10000 ~ "very active"
)) %>%
arrange(user_activity)
head(user_type)
user_type$Id <- as.character(user_type$Id)
ggplot(user_type, aes(x=user_activity, y=Id, fill=mean_daily_steps))+
geom_bar(stat="identity", position = "stack")+
coord_polar("x")+
labs(x=NULL, y=NULL, title="ACTIVITY AND DAILY STEPS")+
theme(axis.title = NULL, axis.text.x = element_text(colour='black', angle=22, face = "bold"))
ggplot(sleep_rate, aes(x=TotalMinutesAsleep, y=TotalTimeInBed, color=Id)) +
geom_jitter()+
geom_abline()+
labs(
x="Sleep Time",
y="Time in Bed",
title = "TIME IN BED vs TIME ASLEEP"
)
Few users seem to experience sleeplessness even after they rest themselves in bed.
sleep_rate$Id <- as.character(sleep_rate$Id)
less_sleep <- sleep_rate %>%
mutate(sleep_hours=TotalMinutesAsleep/60) %>%
filter(sleep_hours <8)
suff_sleep <- sleep_rate %>%
mutate(sleep_hours=TotalMinutesAsleep/60) %>%
filter(sleep_hours>8)
glimpse(suff_sleep)
## Rows: 114
## Columns: 7
## $ Id <chr> "1503960366", "1503960366", "1644430081", "18445050…
## $ Date <chr> "4/17/2016", "5/8/2016", "5/2/2016", "4/15/2016", "…
## $ Time <chr> "12:00:00", "12:00:00", "12:00:00", "12:00:00", "12…
## $ TotalSleepRecords <int> 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ TotalMinutesAsleep <int> 700, 594, 796, 644, 722, 590, 750, 503, 531, 545, 5…
## $ TotalTimeInBed <int> 712, 611, 961, 961, 961, 961, 775, 546, 565, 568, 5…
## $ sleep_hours <dbl> 11.666667, 9.900000, 13.266667, 10.733333, 12.03333…
ggplot(less_sleep, aes(x=Id)) +
geom_bar()+
labs(
x = "Users",
y = "No. of days",
title = "USERS GETTING INSUFFICIENT SLEEP")+
theme(axis.text.x = element_text(angle = 90))
ggplot(suff_sleep, aes(x=Id)) +
geom_bar()+
labs(
x = "Users",
y = "No. of days",
title = "USERS GETTING ADEQUATE SLEEP") +
theme(axis.text.x = element_text(angle = 90))
average_sleep <- sleep_rate %>%
group_by(Id) %>%
summarise(avg_sleep_hour=mean(TotalMinutesAsleep)/60) %>%
mutate(sleep_level=case_when(
avg_sleep_hour>7 ~ "Adequate Sleep",
avg_sleep_hour<7 ~ "Inadequate Sleep"
))
ggplot(average_sleep, aes(x=" ", y=sleep_level, fill=sleep_level))+
geom_bar(stat="identity")+
coord_polar("y")+
labs(x=NULL, y=NULL, title="SLEEP RATE")+
theme(axis.text = element_blank(), legend.position = "top",
legend.title = element_blank())
More than 70% of the users are not getting adequate sleep of 7 hours.
weight_class<-weight_log %>%
group_by(Id) %>%
mutate(weight_category=case_when(
BMI<18.5 ~ "Underweight",
BMI>18.5 & BMI<24.9 ~ "Healthy",
BMI>25.0 & BMI<29.9 ~ "Overweight",
BMI>30.0 ~ "Obese"
))
head(weight_class)
weight_class$Id <- as.character(weight_class$Id)
ggplot(weight_class, aes(x=" ", y=weight_category, fill=weight_category))+
geom_bar(stat="Identity")+
coord_polar("y")+
labs(x=NULL, y=NULL, title="WEIGHT CATEGORY", caption="Weight Data for 8 Users", colour="Category")+
theme(axis.text = element_blank(), legend.title = element_blank(), legend.position = "bottom")
About two-third of the users are either overweight or obese.
weight_comparison <-
merge(daily_activity, weight_log, by.x="Id", by.y="Id")
weight_comparison <- weight_comparison %>%
group_by(Id) %>%
summarise(Id=mean(Id), total_steps=sum(TotalSteps), BMI=mean(BMI)) %>%
arrange(total_steps)
weight_comparison$Id <- as.character(weight_comparison$Id)
ggplot(weight_comparison, aes(x=total_steps, y=BMI, color=Id))+
geom_point()+
geom_bar(stat="identity")+
labs(x="Total Steps Taken", y="BMI", title="BMI vs Activity", caption="Weight Data for 8 Users")+
theme(legend.position = "bottom")
The data available for weight cannot be used for any type of conclusive analysis, since the sample is too uncertain and low.
Even though the chosen data set was not reliable and inconsistent, we can still propose some of the following analysis and observations:
Recommendation : Even though the usage of tracker is consistent, the users who record all the features are feeble, even among the 33 users only 18% of them recorded all three variables (Steps, Sleep, Weight). Bellabeat should encourage all its users to record all variables of health by explaining its importance, benefits of recording all health variables, thereby increasing usage of smart devices. Weekend events can be implemented to engage the users in daily active usage of the tracker.
Recommendation : Bellabeat could organize a campaign to reach a distance milestone, to encourage users cover more steps, burn more calories and eventually maintain a healthy a healthy body.
Recommendation : To improve the sleep rate of the users and help them fall asleep as soon as they go to bed, the fitness tracker could be incorporated with some relaxing and deep focus sounds which starts playing as the user gets to the bed and turns off as they fall asleep.