Bellabeat is a tech-driven wellness company for women founded in 2013, that manufactures health-focused smart products that are beautifully designed to inform and inspire women around the world. Bellabeat technology is developed to Collect data on various health activities, sleep, stress, and reproductive health, which has allowed women to be empowered with the knowledge about their own health and habits.
Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.
Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.
Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.
As a junior data analyst working on the marketing analyst team at Bellabeat, I have been asked to analyze data from non-Bellabeat(FitBit) smart devices usage, in order to gain insight into how consumers are using these smart devices, then apply these insight to one of Bellabeat’s product to help guide marketing strategy for the company.
Urška Sršen: Bellabeat’s co-founder and Chief Creative Officer
Sando Mur: Mathematician and Bellabeat’s co-founder; key member of the Bellabeat executive team
Marketing analytics team: A team of data analysts responsible for guiding Bellabeat’s marketing strategy.
The FitBit Fitness Tracker Data (CC0: Public Domain) dataset used for this project is a public data that explores smart device users’ daily habits, made available through Mobius
These datasets were generated by thirty consenting Fitbit users from a survey distributed via Amazon Mechanical Turk between 03.12.2016-05.12.2016. For this analysis the datasets containing personal tracker data for daily physical activity, weight_info, and sleep monitoring were used.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.1 v dplyr 1.0.5
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(lubridate) #for mdy()
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(janitor) #for clean_names()
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library("corrplot") #for plotting cor matrix
## corrplot 0.90 loaded
library(stats) #for cor()
library("skimr") # for summary()
library(ggpubr) # for pie chart
daily_activity <- read_csv("D:/ugwun/Documents/R projects/fitbit_data/dailyActivity_merged.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## Id = col_double(),
## ActivityDate = col_character(),
## TotalSteps = col_double(),
## TotalDistance = col_double(),
## TrackerDistance = col_double(),
## LoggedActivitiesDistance = col_double(),
## VeryActiveDistance = col_double(),
## ModeratelyActiveDistance = col_double(),
## LightActiveDistance = col_double(),
## SedentaryActiveDistance = col_double(),
## VeryActiveMinutes = col_double(),
## FairlyActiveMinutes = col_double(),
## LightlyActiveMinutes = col_double(),
## SedentaryMinutes = col_double(),
## Calories = col_double()
## )
weight_data <- read_csv("D:/ugwun/Documents/R projects/fitbit_data/weightLogInfo_merged.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## Id = col_double(),
## Date = col_character(),
## WeightKg = col_double(),
## WeightPounds = col_double(),
## Fat = col_double(),
## BMI = col_double(),
## IsManualReport = col_logical(),
## LogId = col_double()
## )
sleep_data <- read_csv("D:/ugwun/Documents/R projects/fitbit_data/sleepDay_merged.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## Id = col_double(),
## SleepDay = col_character(),
## TotalSleepRecords = col_double(),
## TotalMinutesAsleep = col_double(),
## TotalTimeInBed = col_double()
## )
#Take a glimpse at the daily activity data
colnames(daily_activity)
## [1] "Id" "ActivityDate"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories"
glimpse(daily_activity)
## Rows: 940
## Columns: 15
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036~
## $ ActivityDate <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/~
## $ TotalSteps <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019~
## $ TotalDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ TrackerDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5~
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3~
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0~
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveMinutes <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4~
## $ FairlyActiveMinutes <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21~
## $ LightlyActiveMinutes <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, ~
## $ SedentaryMinutes <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818~
## $ Calories <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203~
head(daily_activity)
## # A tibble: 6 x 15
## Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie~
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2016 13162 8.5 8.5 0
## 2 1.50e9 4/13/2016 10735 6.97 6.97 0
## 3 1.50e9 4/14/2016 10460 6.74 6.74 0
## 4 1.50e9 4/15/2016 9762 6.28 6.28 0
## 5 1.50e9 4/16/2016 12669 8.16 8.16 0
## 6 1.50e9 4/17/2016 9705 6.48 6.48 0
## # ... with 9 more variables: VeryActiveDistance <dbl>,
## # ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## # SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## # FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## # SedentaryMinutes <dbl>, Calories <dbl>
#Take a glimpse at the daily activity data
colnames(sleep_data)
## [1] "Id" "SleepDay" "TotalSleepRecords"
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
glimpse(sleep_data)
## Rows: 413
## Columns: 5
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150~
## $ SleepDay <chr> "4/12/2016 12:00:00 AM", "4/13/2016 12:00:00 AM", "~
## $ TotalSleepRecords <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ TotalMinutesAsleep <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2~
## $ TotalTimeInBed <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3~
head(sleep_data)
## # A tibble: 6 x 5
## Id SleepDay TotalSleepRecor~ TotalMinutesAsle~ TotalTimeInBed
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2016 12:00:0~ 1 327 346
## 2 1.50e9 4/13/2016 12:00:0~ 2 384 407
## 3 1.50e9 4/15/2016 12:00:0~ 1 412 442
## 4 1.50e9 4/16/2016 12:00:0~ 2 340 367
## 5 1.50e9 4/17/2016 12:00:0~ 1 700 712
## 6 1.50e9 4/19/2016 12:00:0~ 1 304 320
#Take a glimpse at the daily activity data
colnames(weight_data)
## [1] "Id" "Date" "WeightKg" "WeightPounds"
## [5] "Fat" "BMI" "IsManualReport" "LogId"
glimpse(weight_data)
## Rows: 67
## Columns: 8
## $ Id <dbl> 1503960366, 1503960366, 1927972279, 2873212765, 2873212~
## $ Date <chr> "5/2/2016 11:59:59 PM", "5/3/2016 11:59:59 PM", "4/13/2~
## $ WeightKg <dbl> 52.6, 52.6, 133.5, 56.7, 57.3, 72.4, 72.3, 69.7, 70.3, ~
## $ WeightPounds <dbl> 115.9631, 115.9631, 294.3171, 125.0021, 126.3249, 159.6~
## $ Fat <dbl> 22, NA, NA, NA, NA, 25, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ BMI <dbl> 22.65, 22.65, 47.54, 21.45, 21.69, 27.45, 27.38, 27.25,~
## $ IsManualReport <lgl> TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, ~
## $ LogId <dbl> 1.462234e+12, 1.462320e+12, 1.460510e+12, 1.461283e+12,~
head(weight_data)
## # A tibble: 6 x 8
## Id Date WeightKg WeightPounds Fat BMI IsManualReport LogId
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl>
## 1 1.50e9 5/2/2016 ~ 52.6 116. 22 22.6 TRUE 1.46e12
## 2 1.50e9 5/3/2016 ~ 52.6 116. NA 22.6 TRUE 1.46e12
## 3 1.93e9 4/13/2016~ 134. 294. NA 47.5 FALSE 1.46e12
## 4 2.87e9 4/21/2016~ 56.7 125. NA 21.5 TRUE 1.46e12
## 5 2.87e9 5/12/2016~ 57.3 126. NA 21.7 TRUE 1.46e12
## 6 4.32e9 4/17/2016~ 72.4 160. 25 27.5 TRUE 1.46e12
# How many observations are in each dataset
nrow(daily_activity)
## [1] 940
nrow(sleep_data)
## [1] 413
nrow(weight_data)
## [1] 67
# How many unique IDs are in each dataset
n_distinct(daily_activity$Id)
## [1] 33
n_distinct(sleep_data$Id)
## [1] 24
n_distinct(weight_data$Id)
## [1] 8
# How many duplicate rows are in the dataset
sum(duplicated(daily_activity))
## [1] 0
sum(duplicated(sleep_data))
## [1] 3
sum(duplicated(weight_data))
## [1] 0
From a quick scan of the loaded datasets the following quick observation were made 1. The id column is common in all 3 datasets, and can be used to merge the datasets 2. The data type of the Date variable in the 3 datasets(i.e. daily_activity\(activity_date, sleep_data\)date, and weight_data$date) are currently character variables and needs to be converted to Date format. 4. The sleep_data and the weight_data have both date and time merged in one column and need to be be separated, as only the date variable will be used for the analysis. 5. Only 24 and 8 unique users logged sleep data and weight data respectively, compared to 33 unique users who logged in their daily activities. This implies that most of these users used the device to log their daily activities, but not all of the users track their weight and sleeping habits with the device. 5. There appears to be no duplicate data in the daily_activity and weight_data, however the sleep_data has 3 duplicates, which need to be removed
# Clean column names to lower case
daily_activity <- clean_names(daily_activity)
sleep_data <- clean_names(sleep_data)
weight_data <- clean_names(weight_data)
# Change the activity_date column name to 'date' in the daily_activity dataset
daily_activity <- daily_activity %>%
dplyr::rename(date = activity_date)
# Examine the column names
colnames(daily_activity)
## [1] "id" "date"
## [3] "total_steps" "total_distance"
## [5] "tracker_distance" "logged_activities_distance"
## [7] "very_active_distance" "moderately_active_distance"
## [9] "light_active_distance" "sedentary_active_distance"
## [11] "very_active_minutes" "fairly_active_minutes"
## [13] "lightly_active_minutes" "sedentary_minutes"
## [15] "calories"
colnames(sleep_data)
## [1] "id" "sleep_day" "total_sleep_records"
## [4] "total_minutes_asleep" "total_time_in_bed"
colnames(weight_data)
## [1] "id" "date" "weight_kg" "weight_pounds"
## [5] "fat" "bmi" "is_manual_report" "log_id"
# Removing duplicate date from the sleep_data
sleep_data <- distinct(sleep_data)
# Confirm that the duplicate was removed
sum(duplicated(sleep_data))
## [1] 0
# find missing values
sum(is.na(daily_activity))
## [1] 0
sum(is.na(sleep_data))
## [1] 0
sum(is.na(weight_data))
## [1] 65
# There are 65 missing values in the fat column of the weight_data
# remove the 'fat' column with the missing data in the weight_data and the log_id column
weight_data<- select(weight_data, -fat)
weight_data<- select(weight_data, -log_id)
# Examine the weight_data
head(weight_data)
## # A tibble: 6 x 6
## id date weight_kg weight_pounds bmi is_manual_report
## <dbl> <chr> <dbl> <dbl> <dbl> <lgl>
## 1 1503960366 5/2/2016 11:59:59 PM 52.6 116. 22.6 TRUE
## 2 1503960366 5/3/2016 11:59:59 PM 52.6 116. 22.6 TRUE
## 3 1927972279 4/13/2016 1:08:52 AM 134. 294. 47.5 FALSE
## 4 2873212765 4/21/2016 11:59:59 ~ 56.7 125. 21.5 TRUE
## 5 2873212765 5/12/2016 11:59:59 ~ 57.3 126. 21.7 TRUE
## 6 4319703577 4/17/2016 11:59:59 ~ 72.4 160. 27.5 TRUE
# Convert the data type of the date column from character variable to date variable
daily_activity$date <- lubridate::mdy(daily_activity$date)
daily_activity <- mutate(daily_activity, weekday = weekdays(date))
# confirm that the data type is changed from character to date
head(daily_activity)
## # A tibble: 6 x 16
## id date total_steps total_distance tracker_distance logged_activiti~
## <dbl> <date> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 2016-04-12 13162 8.5 8.5 0
## 2 1.50e9 2016-04-13 10735 6.97 6.97 0
## 3 1.50e9 2016-04-14 10460 6.74 6.74 0
## 4 1.50e9 2016-04-15 9762 6.28 6.28 0
## 5 1.50e9 2016-04-16 12669 8.16 8.16 0
## 6 1.50e9 2016-04-17 9705 6.48 6.48 0
## # ... with 10 more variables: very_active_distance <dbl>,
## # moderately_active_distance <dbl>, light_active_distance <dbl>,
## # sedentary_active_distance <dbl>, very_active_minutes <dbl>,
## # fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
## # sedentary_minutes <dbl>, calories <dbl>, weekday <chr>
# sleep_data cleaning: separate sleep_day column to date and time column, convert the date from character variable to date format, and add weekdays column
sleep_data <- sleep_data %>%
separate(sleep_day,c("date","time"), sep=" ") %>%
mutate(date = mdy(date), weekday = weekdays(date)) %>%
select(-"time")
## Warning: Expected 2 pieces. Additional pieces discarded in 410 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
#
sleep_data$weekday <- factor(sleep_data$weekday,
levels = c("Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday",
"Sunday"))
# confirm that the data type is changed from character to date format
head(sleep_data)
## # A tibble: 6 x 6
## id date total_sleep_reco~ total_minutes_a~ total_time_in_b~ weekday
## <dbl> <date> <dbl> <dbl> <dbl> <fct>
## 1 1.50e9 2016-04-12 1 327 346 Tuesday
## 2 1.50e9 2016-04-13 2 384 407 Wednes~
## 3 1.50e9 2016-04-15 1 412 442 Friday
## 4 1.50e9 2016-04-16 2 340 367 Saturd~
## 5 1.50e9 2016-04-17 1 700 712 Sunday
## 6 1.50e9 2016-04-19 1 304 320 Tuesday
# weight_data cleaning: separate date column to date and time column, convert the date from character variable to date format
weight_data <- weight_data %>%
separate(date, c("date", "time"), sep = " ")%>%
select(-"time")%>%
mutate(date = mdy(date), weekday = weekdays(date))
## Warning: Expected 2 pieces. Additional pieces discarded in 67 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
# confirm that the data type is changed from character to date
head(weight_data)
## # A tibble: 6 x 7
## id date weight_kg weight_pounds bmi is_manual_report weekday
## <dbl> <date> <dbl> <dbl> <dbl> <lgl> <chr>
## 1 1503960366 2016-05-02 52.6 116. 22.6 TRUE Monday
## 2 1503960366 2016-05-03 52.6 116. 22.6 TRUE Tuesday
## 3 1927972279 2016-04-13 134. 294. 47.5 FALSE Wednesday
## 4 2873212765 2016-04-21 56.7 125. 21.5 TRUE Thursday
## 5 2873212765 2016-05-12 57.3 126. 21.7 TRUE Thursday
## 6 4319703577 2016-04-17 72.4 160. 27.5 TRUE Sunday
# adding a sleep_quality column to categorize the minutes of sleep into adequate, excessive, or sleep deprived per CDC recommendation)
sleep_data <- sleep_data %>%
mutate(Sleep_quality = case_when(
sleep_data$total_minutes_asleep < 420 ~ "sleep deprived",
sleep_data$total_minutes_asleep >= 420 &
sleep_data$total_minutes_asleep <= 540 ~ "adequate sleep",
sleep_data$total_minutes_asleep > 540 ~ "excessive sleep"))
# Take a look at the dataframe
head(sleep_data)
## # A tibble: 6 x 7
## id date total_sleep_reco~ total_minutes_a~ total_time_in_b~ weekday
## <dbl> <date> <dbl> <dbl> <dbl> <fct>
## 1 1.50e9 2016-04-12 1 327 346 Tuesday
## 2 1.50e9 2016-04-13 2 384 407 Wednes~
## 3 1.50e9 2016-04-15 1 412 442 Friday
## 4 1.50e9 2016-04-16 2 340 367 Saturd~
## 5 1.50e9 2016-04-17 1 700 712 Sunday
## 6 1.50e9 2016-04-19 1 304 320 Tuesday
## # ... with 1 more variable: Sleep_quality <chr>
weight_data <- weight_data %>%
mutate(weight_status = case_when(
weight_data$bmi < 18.5 ~ "underweight",
weight_data$bmi >= 18.5 & weight_data$bmi <= 24.9 ~ "healthy weight",
weight_data$bmi >= 25.0 & weight_data$bmi <= 29.9 ~ "overweight",
weight_data$bmi > 30.0 ~ "obesity"))
# take a look at the data frame
head(weight_data)
## # A tibble: 6 x 8
## id date weight_kg weight_pounds bmi is_manual_report weekday
## <dbl> <date> <dbl> <dbl> <dbl> <lgl> <chr>
## 1 1503960366 2016-05-02 52.6 116. 22.6 TRUE Monday
## 2 1503960366 2016-05-03 52.6 116. 22.6 TRUE Tuesday
## 3 1927972279 2016-04-13 134. 294. 47.5 FALSE Wednesday
## 4 2873212765 2016-04-21 56.7 125. 21.5 TRUE Thursday
## 5 2873212765 2016-05-12 57.3 126. 21.7 TRUE Thursday
## 6 4319703577 2016-04-17 72.4 160. 27.5 TRUE Sunday
## # ... with 1 more variable: weight_status <chr>
daily_activity %>%
select(total_steps,
total_distance,
sedentary_minutes) %>%
summary()
## total_steps total_distance sedentary_minutes
## Min. : 0 Min. : 0.000 Min. : 0.0
## 1st Qu.: 3790 1st Qu.: 2.620 1st Qu.: 729.8
## Median : 7406 Median : 5.245 Median :1057.5
## Mean : 7638 Mean : 5.490 Mean : 991.2
## 3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.:1229.5
## Max. :36019 Max. :28.030 Max. :1440.0
Assumptions: 1. Daily usage were calculated from daily steps logged 2. Zero daily step (step = 0) would be considered as NO USAGE 3. Daily step greater than zero (step > 0) are considered ACTIVE USAGE 4. Active_days was defined as the total days of ACTIVE USAGE (steps >0) 5. Total_days was defined as the total observed days 6. Usage rate was defined as the active_days / total_days
# Creating dataframe for daily usage
daily_usage_df <- daily_activity %>%
group_by(id) %>%
summarise( active_days = sum(total_steps != 0),
total_days = sum(total_steps >= 0),
usage_rate = active_days / total_days)
#checking summary of usage rate
daily_usage_df%>%
select(-id)%>%
summary()
## active_days total_days usage_rate
## Min. : 3.00 Min. : 4.00 Min. :0.5484
## 1st Qu.:21.00 1st Qu.:29.00 1st Qu.:0.9231
## Median :30.00 Median :31.00 Median :1.0000
## Mean :26.15 Mean :28.48 Mean :0.9138
## 3rd Qu.:31.00 3rd Qu.:31.00 3rd Qu.:1.0000
## Max. :31.00 Max. :31.00 Max. :1.0000
# grouping users by usage rate
daily_usage_df <- daily_usage_df %>%
mutate( user_type = case_when(
usage_rate == 1 ~ "perfect user",
usage_rate < 1 & usage_rate >= 0.8 ~ "active user",
usage_rate < 0.8 & usage_rate >= 0.5 ~ "average user",
usage_rate < 0.5 ~ "casual user"))
# Creating a data frame for plotting different category of users
daily_usage_plot <- daily_usage_df %>%
group_by(user_type) %>%
summarise(user_count = n()) %>%
mutate(perc_usage = (round(user_count / sum(user_count)*100,0)))
# assign and define color pallet
mycols <- c("azure4", "#BFC9CA", "#FADBD8")
# paste percentage sign to calculated value
labs <- paste0(daily_usage_plot$perc_usage, "%")
# Plot pie chart distribution of users
ggpie(daily_usage_plot, "user_count", label = labs,
fill = "user_type", color = "white",
palette = (mycols))
Assumptions: 1. Daily usage were calculated from daily steps logged 2. Zero daily step (step = 0) would be considered as NO USAGE 3. Daily step greater than zero (step > 0) are considered ACTIVE USAGE 4. Active_users was defined as the total users with (steps >0) 5. Total_users was defined as the total observed users 6. Usage rate was defined as the active_user / total_user
Usage rate were slightly higher on Fridays than other days of the week.
# Creating data frame
daily_usage_weekday <- daily_activity %>%
group_by(weekday) %>%
summarise( active_user = sum(total_steps != 0),
total_user = sum(total_steps >= 0),
usage_rate = active_user / total_user)
#Plotting daily usage per day
ggplot(data=daily_usage_weekday, aes(x=weekday, y=usage_rate)) +
geom_bar(stat="identity", width=0.5, fill = "#FADBD8") +
#zoom in y-axis to "0.8 ~ 1"
coord_cartesian(ylim = c(0.8, 1))+
#main title text
ggtitle("Usage Rate By Weekdays") +
# x, y axis label text
xlab("Weekdays") + ylab("Usage Rate") +
#text settings
theme(
plot.title = element_text(color="black", size=24, face="bold"),
axis.title.x = element_text(color="black", size=18, face="bold"),
axis.title.y = element_text(color="black", size=18, face="bold"))
# Creating data frame
usage_activity <- daily_activity %>%
group_by(id) %>%
summarise( active_user = sum(total_steps != 0),
total_user = sum(total_steps >= 0),
usage_rate = active_user / total_user,
avg_sedentary_minutes = mean(sedentary_minutes),
avg_calories_burned = mean(calories),
avg_steps = mean(total_steps))
# Relationship between usage rate and calories burned
ggscatter(usage_activity, x = "usage_rate", y = "avg_calories_burned",
color = "black", shape = 21, size = 3, # Points color, shape and size
add = "reg.line", # Add regression line
add.params = list(color = "blue", fill = "lightgray"), # Customize reg. line
conf.int = TRUE, # Add confidence interval
cor.coef = TRUE # Add correlation coefficient
)
## `geom_smooth()` using formula 'y ~ x'
# Relationship between usage rate and total steps
ggscatter(usage_activity, x = "usage_rate", y = "avg_steps",
color = "black", shape = 21, size = 3, # Points color, shape and size
add = "reg.line", # Add regression line
add.params = list(color = "blue", fill = "lightgray"), # Customize reg. line
conf.int = TRUE, # Add confidence interval
cor.coef = TRUE # Add correlation coefficient
)
## `geom_smooth()` using formula 'y ~ x'
# Relationship between usage rate and sedentary time
ggscatter(usage_activity, x = "usage_rate", y = "avg_sedentary_minutes",
color = "black", shape = 21, size = 3, # Points color, shape and size
add = "reg.line", # Add regression line
add.params = list(color = "blue", fill = "lightgray"), # Customize reg. line
conf.int = TRUE, # Add confidence interval
cor.coef = TRUE # Add correlation coefficient
)
## `geom_smooth()` using formula 'y ~ x'
Most of the participant spent most of their time without doing any form of activities (high sedentary minutes) followed by light activity. Tuesday appears to have the most active minutes overall.
# Creating a data frame to summarize the activity levels by week day
active_minutes <- daily_activity %>%
select(weekday, very_active_minutes, fairly_active_minutes, lightly_active_minutes, sedentary_minutes) %>%
group_by(weekday) %>%
summarize(very_active = sum(very_active_minutes),
fairly_active = sum(fairly_active_minutes),
lightly_active = sum(lightly_active_minutes),
sedentary= sum(sedentary_minutes))
#order the factor levels to follow a set sequence
active_minutes$weekday <- factor(active_minutes$weekday,
levels = c("Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday",
"Sunday"))
# Convert to long data for easy plotting
active_minutes_long <- pivot_longer(active_minutes,
cols = very_active:sedentary,
names_to = "activity_levels",
values_to = "total_minutes")
# Plotting activity intensity levels by weekday
ggplot(active_minutes_long, aes(x = weekday, y = total_minutes)) +
geom_col(aes(fill = activity_levels), position = position_dodge2(preserve = "single")) +
labs(title = "Active Minutes by Intensity Per Day",
x = "Day of the Week",
y = "Total Active Minutes",
fill = "Level of Intensity") +
theme(axis.text.x=element_text(angle=45,hjust=1))
# Expanding on the activity levels without the sedentary time
active_minutes_2 <- daily_activity %>%
select(weekday, very_active_minutes, fairly_active_minutes, lightly_active_minutes) %>%
group_by(weekday) %>%
summarize(very_active = sum(very_active_minutes),
fairly_active = sum(fairly_active_minutes),
lightly_active = sum(lightly_active_minutes))
active_minutes_2$weekday <- factor(active_minutes$weekday,
levels = c("Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday",
"Sunday"))
active_minutes_2_long <- pivot_longer(active_minutes,
cols = very_active:lightly_active,
names_to = "activity_levels",
values_to = "total_minutes")
ggplot(active_minutes_2_long, aes(x = weekday, y = total_minutes)) +
geom_col(aes(fill = activity_levels), position = position_dodge2(preserve = "single")) +
labs(title = "Active Minutes by Intensity Per Day",
x = "Day of the Week",
y = "Total Active Minutes",
fill = "Level of Intensity") +
theme(axis.text.x=element_text(angle=45,hjust=1))
Although the sleep data is limited, the smart device users who recorded their sleep shows no significant correlation between the quality of sleep and the levels of activity intensity, with the exception of sedentary minutes, which was negatively correlated with sleep quality (R= -0.6). This means that more time spent in sedentary is associated with sleep deprivation.
#
# Creating data frame by joining sleep data and activity data, followed selecting columns that ween and filtering out rows with NA
Sleep_daily_activity <- left_join(daily_activity, sleep_data, by = c("id", "date"))
sleep_activity_df <- Sleep_daily_activity %>%
select(date, very_active_minutes, fairly_active_minutes, lightly_active_minutes,
sedentary_minutes, total_minutes_asleep, Sleep_quality)
sleep_activity_df <- sleep_activity_df %>%
filter(!is.na(total_minutes_asleep)) %>%
filter(!is.na(Sleep_quality))
# Plotting activity levels vs sleep quality
sleep_activity_df %>%
ggplot(mapping = aes(x = total_minutes_asleep, y = very_active_minutes)) +
geom_point(aes(color = Sleep_quality)) +
scale_color_brewer(palette = "Set1") +
stat_cor(aes(color = Sleep_quality), label.x = 3)
sleep_activity_df %>%
ggplot(mapping = aes(x = total_minutes_asleep, y = fairly_active_minutes)) +
geom_point(aes(color = Sleep_quality)) +
scale_color_brewer(palette = "Set1") +
stat_cor(aes(color = Sleep_quality), label.x = 3)
sleep_activity_df %>%
ggplot(mapping = aes(x = total_minutes_asleep, y = lightly_active_minutes)) +
geom_point(aes(color = Sleep_quality)) +
scale_color_brewer(palette = "Set1") +
stat_cor(aes(color = Sleep_quality), label.x = 3)
sleep_activity_df %>%
ggplot(mapping = aes(x = total_minutes_asleep, y = sedentary_minutes)) +
geom_point(aes(color = Sleep_quality)) +
scale_color_brewer(palette = "Set1") +
stat_cor(aes(color = Sleep_quality), label.x = 3)
There is a strong positive correlation between time spent in bed and overall quality of sleep. That is users who spend more time in bed get more sleep.
ggplot(data=sleep_data, aes(x=total_minutes_asleep, y=total_time_in_bed)) +
geom_point(aes(color = Sleep_quality))+
scale_color_brewer(palette = "Set2") +
stat_cor()
Weekends(Sundays and Saturdays) had the highest recorded number of excessive sleep, while Wednesday had the highest number of adequate sleep. Tuesdays records the highest level of sleep deprivation.
sleep_qual_day <- sleep_data %>%
select(weekday, Sleep_quality) %>%
group_by(weekday, Sleep_quality) %>%
tally()
ggplot(sleep_qual_day, aes(x = weekday, y = n)) +
geom_col(aes(fill = Sleep_quality))+
labs(title = "Sleep Quality Per Day",
x = "Day of the Week",
y = "Quality of Sleep",
fill = "Level of Intensity") +
theme(axis.text.x=element_text(angle=45,hjust=1))
There is no significant correlation between the weight and sleep quality, however the one user whose weight was recorded as obese also had sleep deprivation, this could be potentially significant but the data is not sufficient to make a reasonable conclusion.
all_activity_log <- left_join(Sleep_daily_activity, weight_data, by = c("id", "date"))
# weight and sleep
weight_sleep_df <- all_activity_log %>%
select(date, bmi, total_minutes_asleep, Sleep_quality, weight_status)
head(weight_sleep_df)
## # A tibble: 6 x 5
## date bmi total_minutes_asleep Sleep_quality weight_status
## <date> <dbl> <dbl> <chr> <chr>
## 1 2016-04-12 NA 327 sleep deprived <NA>
## 2 2016-04-13 NA 384 sleep deprived <NA>
## 3 2016-04-14 NA NA <NA> <NA>
## 4 2016-04-15 NA 412 sleep deprived <NA>
## 5 2016-04-16 NA 340 sleep deprived <NA>
## 6 2016-04-17 NA 700 excessive sleep <NA>
weight_sleep_df %>%
filter(!is.na(total_minutes_asleep)) %>%
filter(!is.na(Sleep_quality)) %>%
filter(!is.na(weight_status)) %>%
filter(!is.na(bmi)) %>%
ggplot(mapping = aes(x = total_minutes_asleep, y = bmi)) +
geom_point(aes(color = Sleep_quality, weight_status), size = 5) +
labs(title = "Sleep vs Weight", x = "Total Time Asleep for 30 Days",
y = "BMI") +
scale_color_brewer(palette = "Set1")+
stat_cor()
There appears to be no correlation between the number of calories burned in the day and the number time spent asleep.
#Calories and sleep
sleep_calories_df <- all_activity_log %>%
select(calories, total_minutes_asleep, Sleep_quality)%>%
filter(!is.na(total_minutes_asleep)) %>%
filter(!is.na(Sleep_quality))
ggplot(data = sleep_calories_df,aes(x=total_minutes_asleep, y=calories)) +
#add transparency (alpha) to avoid over plotting
geom_point(alpha = 0.5, aes(color = Sleep_quality)) +
#add text labels
labs(title="Time Asleep vs. Calories",
x = "Total Minutes Asleep",
y = "Calories Burned") +
stat_cor()
Obese users spend most time in sedentary positing and little time in doing lightly-active activities. Overweight users spend most time in sedentary position and interesting they have the highest time spent doing highly-active activities.
# Creating data frame for weigh and activity levels
weight_activity <- all_activity_log %>%
select(id, bmi, very_active_minutes, fairly_active_minutes, lightly_active_minutes, sedentary_minutes, weight_status) %>%
filter(!is.na(bmi))%>%
filter(!is.na(weight_status))%>%
group_by(id)
weight_activity_long <- pivot_longer(weight_activity,
cols = very_active_minutes:sedentary_minutes,
names_to = "activity_levels",
values_to = "total_minutes")
ggplot(weight_activity_long, aes(x = weight_status, y = total_minutes)) +
geom_col(aes(fill = activity_levels), position = 'dodge') +
labs(title = "Active Minutes by Intensity Per Day",
x = "Day of the Week",
y = "Total Active Minutes",
fill = "Level of Intensity") +
theme(axis.text.x=element_text(angle=45,hjust=1))
Based on the analysis it shows that most users of BellaBeat smart device record their daily physical activities more than their sleep and weight.It is not clear which of the devices these data was gotten from, but it will be interesting to find out if users inability to record sleep data is because they take their device off while sleeping (example people may be uncomfortable sleeping with their wrist watch or bracelet). A viable solution will be to redesign and sync the app to track sleep even when users are not wearing their devices (Samsung Health app already does this)
The Marketing should engage users in healthy habits, by prompting and recommending activities when users have been inactive for some time.
Sending insights and analysis of users weekly activities and sleep records, e.g. will be how their activity levels for the week could be affecting their sleep/ This could also help the users become aware of their health status and may prompt them to engage more in healthy lifestyle.
Low User sample data: The dataset is small and incomplete in the case of sleep data which, which is not a true representation of the population. As a result inference made from the analysis may not be statistically significant.
Time frame: The time frame of data collection is only limited to a period of 31 days which could largely decrease the possibility of finding some significant insights.