This is a capstone project I have performed in RStudio being part of my Google Data Analytics Professional Certificate course.
You are a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. You have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights you discover will then help guide marketing strategy for the company. You will present your analysis to the Bellabeat executive team along with your high-level recommendations for Bellabeat’s marketing strategy.
Analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices then apply these insights to one of Bellabeat’s products.
The data is located here.
The data is of Fitbit Fitness Tracker stored in public domain in Kaggle. The author is Mobius. About 30 Fitbit users allowed collection of personal tracker data on their active time, sleep and heart rate for a period of one month (04/12/2016-05/12/2016). A total of 18 files are available in csv format.
Loading needed library
library('tidyverse')
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library('here')
## here() starts at C:/Users/PC/OneDrive/Documents/My Coursera/Capstone Project
library('skimr')
library('janitor')
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library('DT')
Importing dataset
daily_activity <- read_csv("C:/Users/PC/OneDrive/Documents/My Coursera/Capstone Project/Fitbit Fitness Tracker/dailyActivity_merged.csv")
sleep_day <- read_csv("C:/Users/PC/OneDrive/Documents/My Coursera/Capstone Project/Fitbit Fitness Tracker/sleepDay_merged.csv")
heartrate_seconds <- read_csv("C:/Users/PC/OneDrive/Documents/My Coursera/Capstone Project/Fitbit Fitness Tracker/heartrate_seconds_merged.csv")
weight_log_info <- read_csv("C:/Users/PC/OneDrive/Documents/My Coursera/Capstone Project/Fitbit Fitness Tracker/weightLogInfo_merged.csv")
A brief display of the data table
datatable(head(daily_activity, 10), class = 'display',
options = list(pageLength = 5, dom = 'tip', scrollX = TRUE),
rownames = FALSE)
Convert data types as appropriate
daily_activity$ActivityDate <- mdy(daily_activity$ActivityDate)
sleep_day$SleepDay <- mdy_hms(sleep_day$SleepDay, tz=Sys.timezone())
heartrate_seconds$Time <- mdy_hms(heartrate_seconds$Time, tz=Sys.timezone())
Change column names to lower case, add underscore & ensure uniqueness
# Store the result in the existing table
daily_activity <- clean_names(daily_activity)
sleep_day <- clean_names(sleep_day)
heartrate_seconds <- clean_names(heartrate_seconds)
Confirm the numbers of unique IDs in each table
n_distinct(daily_activity$id)
## [1] 33
n_distinct(sleep_day$id)
## [1] 24
n_distinct(weight_log_info$Id)
## [1] 8
n_distinct(heartrate_seconds$id)
## [1] 14
Out of the overall participants which is 33, the weight and heart rate table has just 8 and 14 participants respectively. I am afraid that is too low.
# Distance per category
daily_activity %>%
select(daily_steps,
very_active_distance,
moderately_active_distance,
lightly_active_distance,
sedentary_distance) %>%
summary()
## daily_steps very_active_distance moderately_active_distance
## Min. : 0 Min. : 0.000 Min. :0.0000
## 1st Qu.: 3790 1st Qu.: 0.000 1st Qu.:0.0000
## Median : 7406 Median : 0.210 Median :0.2400
## Mean : 7638 Mean : 1.503 Mean :0.5675
## 3rd Qu.:10727 3rd Qu.: 2.053 3rd Qu.:0.8000
## Max. :36019 Max. :21.920 Max. :6.4800
## lightly_active_distance sedentary_distance
## Min. : 0.000 Min. :0.000000
## 1st Qu.: 1.945 1st Qu.:0.000000
## Median : 3.365 Median :0.000000
## Mean : 3.341 Mean :0.001606
## 3rd Qu.: 4.782 3rd Qu.:0.000000
## Max. :10.710 Max. :0.110000
Highest daily steps taken stands at 36,019 and and on average, it is 7,638
# Minutes per category
daily_activity %>%
select(very_active_minutes,
moderately_active_minutes,
lightly_active_minutes,
sedentary_minutes) %>%
summary()
## very_active_minutes moderately_active_minutes lightly_active_minutes
## Min. : 0.00 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:127.0
## Median : 4.00 Median : 6.00 Median :199.0
## Mean : 21.16 Mean : 13.56 Mean :192.8
## 3rd Qu.: 32.00 3rd Qu.: 19.00 3rd Qu.:264.0
## Max. :210.00 Max. :143.00 Max. :518.0
## sedentary_minutes
## Min. : 0.0
## 1st Qu.: 729.8
## Median :1057.5
## Mean : 991.2
## 3rd Qu.:1229.5
## Max. :1440.0
Average of categories of daily intensity(Distance)
# daily intensity distance
avg_intensity_levels <- daily_activity %>%
summarise(very_active_distance = mean(very_active_distance),
moderately_active_distance = mean(moderately_active_distance),
lightly_active_distance = mean(lightly_active_distance),
sedentary_distance = mean(sedentary_distance))
Pivot(convert) from wide to long
avg_intensity_levels_long <- avg_intensity_levels %>%
pivot_longer(cols = c('very_active_distance', 'moderately_active_distance', 'lightly_active_distance', 'sedentary_distance'),
names_to = 'avg_intensity_level',
values_to = 'kilometer')
#Add column for distance in meter
avg_intensity_levels_long <- mutate(avg_intensity_levels_long,
meters = kilometer*1000)
avg_intensity_levels_long
## # A tibble: 4 × 3
## avg_intensity_level kilometer meters
## <chr> <dbl> <dbl>
## 1 very_active_distance 1.50 1503.
## 2 moderately_active_distance 0.568 568.
## 3 lightly_active_distance 3.34 3341.
## 4 sedentary_distance 0.00161 1.61
ggplot(data = avg_intensity_levels_long, aes(x = avg_intensity_level, y = kilometer))+
geom_col(fill = '#56B4E9', color = 'black')+
xlab('Intensity distance categories')+
ylab('distance in km')+
labs(title = 'Average daily-intensity distance',
caption = 'FitBit data from Kaggle',
tag = 'Fig. 1') +
theme(plot.tag.position = 'bottomleft')
Average of categories of daily intensity(Minutes)
# daily intensity minutes
avg_intensity_mins <- daily_activity %>%
summarise(very_active_minutes = mean(very_active_minutes),
moderately_active_minutes = mean(moderately_active_minutes),
lightly_active_minutes = mean(lightly_active_minutes),
sedentary_minutes = mean(sedentary_minutes))
Pivot from wide to long
avg_active_mins_long <- avg_intensity_mins %>%
pivot_longer(cols = c('very_active_minutes', 'moderately_active_minutes', 'lightly_active_minutes', 'sedentary_minutes'),
names_to = 'avg_active_times',
values_to = 'minutes')
# Add column for time in hours
avg_active_mins_long <- mutate(avg_active_mins_long, hours = minutes/60)
avg_active_mins_long
## # A tibble: 4 × 3
## avg_active_times minutes hours
## <chr> <dbl> <dbl>
## 1 very_active_minutes 21.2 0.353
## 2 moderately_active_minutes 13.6 0.226
## 3 lightly_active_minutes 193. 3.21
## 4 sedentary_minutes 991. 16.5
Users spent, on average, just 21 minutes very active everyday and 9991 minutes which is equal to over 16 hours being Inactive(sedentary). That is a lot of hours spent being inactive.
Data Viz
ggplot(data = avg_active_mins_long, aes(x = avg_active_times, y = hours))+
geom_col(fill = '#0072B2', color = 'black')+
xlab('Time-spent categories')+
ylab('minutes/hour')+
labs(title = 'Average daily-intensity minutes',
caption = 'FitBit data from Kaggle',
tag = 'Fig. 2') +
theme(plot.tag.position = 'bottomleft')
Average distance/minutes
distance_per_mins <- avg_intensity_levels_long$meters/avg_active_mins_long$minutes
distance_per_mins
## [1] 70.998743400 41.839071311 17.326752884 0.001620627
Users cover about 71m in 1 very active minute, about 42m in 1 moderately active minute and just 17m in 1 lightly active minute
Data Viz
barplot(distance_per_mins,
names = c('very_active', 'moderate', 'light', 'sedentary'),
col="dodgerblue3",
main = 'Average distance per minutes',
ylab = 'Meters',
xlab = 'Minutes',
sub = 'Fig. 3')
Calories versus daily steps
#Relationship between calories burnt and daily steps per week
daily_activity <- daily_activity %>%
mutate(week = case_when(
between(activity_date, as.Date('2016-04-12'), as.Date('2016-04-18'))~ 'week_1',
between(activity_date, as.Date('2016-04-19'), as.Date('2016-04-25'))~ 'week_2',
between(activity_date, as.Date('2016-04-26'), as.Date('2016-05-02'))~ 'week_3',
between(activity_date, as.Date('2016-05-03'), as.Date('2016-05-09'))~ 'week_4',
TRUE~ 'NA'))
calories_per_step <- daily_activity %>%
select(activity_date, daily_steps, calories, week) %>%
filter(week == 'week_1'|week == 'week_2'|
week == 'week_3'|week == 'week_4')
Data Viz
#Calories burned per steps taken daily
ggplot(data = calories_per_step, aes(x = daily_steps, y = calories))+
geom_point(color = 'blue')+
geom_smooth(color = 'red')+
facet_wrap(~week)+
xlab('Average daily steps')+
ylab('Average kcals')+
labs(title = 'Average calories burned/steps taken daily',
caption = 'FitBit data from Kaggle',
tag = 'Fig. 4') +
theme(plot.tag.position = 'bottomleft')
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
There is a positive correlation between steps taken daily and calories burned for the data shows that as the daily steps increases, the calories burned increases. Even though the trend line seems to be moving downward towards the peak of the 3rd week, the variable that caused this could be seen as an outlier. There is also a shift during the middle of the 4th week but the trend still continued to move upward. Each week may not follow exact pattern but the overall trend is clearly from the lower left to the upper right of the plot.
Average heart rate versus average sleep time
#Join average heart rate and sleep minutes
heart_rate_sleep <- avg_heart_rate %>%
full_join(avg_sleep_mins, by = 'id')
Data Viz
ggplot(data = heart_rate_sleep, aes(x = average_sleep_mins, y = average_heart_rate))+
geom_point(color = 'blue')+
labs(title = 'Average heart rate vs sleep time',
caption = 'FitBit data from Kaggle',
tag = 'Fig. 6')+
theme(plot.tag.position = 'bottomleft')+
xlab('Average sleep minutes--->')+
ylab('Average heart rate--->')
The data points are scattered and it shows there is no correlation between average sleep time and heart rate but this could be due to insufficient data to form a basis for conclusion.
Relationship between minutes asleep and time in bed
ggplot(data = sleep_day, aes(x = total_time_in_bed, y = total_minutes_asleep))+
geom_point(color = 'blue')+
geom_smooth(color = 'red')+
labs(title = 'Time asleep vs time in bed',
caption = 'FitBit data from Kaggle',
tag = 'Fig. 5')+
theme(plot.tag.position = 'bottomleft')+
xlab('minutes in bed--->')+
ylab('sleep minutes--->')
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
There is a positive correlation between time spent in bed and time spent asleep. As the minutes spent in bed increases, the higher the time spent sleeping.
Bellabeat should collect data from their app and other products and perform a detailed analysis of the users behavior