This case study is part of the Google Data Analytics Capstone Project and served as my very first data analytics project. It has been an intensive learning process and this project serve as a culmination of everything I have learnt throughtout this program.
Bellabeat is a high-tech manufacturer of health-focused smart products specialized for women. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market.
Bellabeat has several wellness-tracker products.
Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.
Time:This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.
Ivy: A wellness-tracker designed as an elegant bracelet that monitors users heart rate, respiratory rate, and physical and mental activity. The data is combined and displayed as wellness score and readiness score to keep track on daily habit.
Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.
By 2016, Bellabeat had opened offices around the world and launched multiple products. Bellabeat products became available through a growing number of online retailers in addition to their own e-commerce channel on their website. The company has been making several marketing efforts, listed as follows.
The Cofounder and Chief Creative Officer of Bellabeat believe that analyzing smart device usage data from non-Bellabeat products will provide insights that is applicable to one of the Bellabeat product.
This report will attempt to answer these business questions:
The data analysis will focus on the following tasks: 1. Identifying users’ habit in tracking daily activities and recording sleep. 2. Determining metrics to identify users activity level; intensity value vs total steps. 3. Identifying trends in users’ activity levels. 4. Identifying trends of hourly activity across different user types. 5. Identifying trends of daily activity on days of the week across different user types.
The dataset provided by this case study is FitBit Fitness Tracker Data. Data is made available by Möbius on Kaggle. This dataset generated by respondents to a distributed survey via Amazon Mechanical Turk between 2016-12-03 and 2016-12-05. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors and preferences.
This dataset is listed as CC0: Public Domain. It is available for use publicly without requiring any permission.
The dataset contains eighteen tables. There are long and wide data types depending on the tables referred. The dataset provides daily, hourly, and minute-level of data for calories, intensities, and sleep datasets.
Data credibility determines whether a data source is a good or a bad one. A good data source must be reliable, original, comprehensive, current, and cited. Metadata (data about the data) usually provide this level of information. Credibility issues can be solved strictly by collecting additional data, completing missing variables, and gathering data from different sources. The following tables explain some issues with this dataset and how analysts could address the problems.
| Credibility Issues | Solution |
|---|---|
| Incomplete demographic data such as gender, weight, and age, which are some of important variables determining activity behaviors | Complete the missing demographic data from sample users |
| Sample bias, no evidence that the data only came from female users | Confirm the gender of sample, remove sample that doesn’t comply with the population of consumers |
| Small sample size of 33 users and 31 days | Collect additional data from different users in a more extended time |
| The data is collected by another party and then made public | Second party data should be verified and confirmed by data vendor to ensure integrity and completeness. |
| The data is not current | Collect and use the most recent data |
Base on the above issues stated, the data credibility can be assessed as poor. However, data with good credibility consumes effort and time and it is up to the analyst judgement call to decide whether the existing data is sufficiently credible to answer the business task. This limitation should be kept in mind when drawing conclusions from this analysis exercise.
Data integrity deals with the accuracy, completeness, consistency, and trustworthiness of data. It encompasses data quality and data security and the ability of the data to answer the required business questions. To address data integrity, analyst has an arsenal of tools to verify that the data is complete, and free of any errors such as null values, multiple data types in a column, etc. Below are a comparison table that describes the datasets used:
| Table Name | Description | Total Observations | Variables | Integrity Status |
|---|---|---|---|---|
| dailyActivity_merged | Daily activity of 33 users in 31 days | 940 | Number of steps, Distance, Active minutes, Calories | No duplicates, no nulls, barely sufficient data, aligned to objectives |
| sleepDay_merged | Daily sleep records of 24 users in 31 days | 410 | Sleep count, Total minutes of sleep, Total minutes of time in bed | Duplicates, no nulls, insufficient data, aligned to objectives |
| hourlySteps_merged | Hourly steps count of 33 users in 31 days | 22,099 | Steps total per hour per day | No duplicates, no nulls, barely sufficient data, aligned to objectives |
It seems the data integrity of this dataset is good as the issues are most of the data has no duplicates. Duplicate data can be eliminated relatively easy using R code. The datasets are also comptible with each other and allows merging and comparison across the tables. This enables a more comprehensive analysis that can relate users’ activity, intensities and calories burned.
From this section onward, the data is processed using R in RStudio. R is an exceptional tool for data exploration, cleaning, analysis, and visualization because it has a variety of functions in packages to perform tasks. It can handle a large amount of data, record every data manipulation step in the script, and compile the results in a document (like this one) conveniently.
The following packages contain useful functions to produce insightful trends out of the data.
tidyverse: an opinionated collection of R packages
designed for data science. All packages share an underlying design
philosophy, grammar, and data structures. It loads multiple packages at
once, with ggplot2, dplyr, and
tibble as some of the most popular ones.lubridate: makes it easier to work with dates and
times.skimr: provides summary statistics about variables in
data frames, tibbles, data tables and vectors.janitor: has simple functions for examining and
cleaning dirty data.ggalluvial: a ggplot2 extension for
producing alluvial plots.ggpubr: provides some easy-to-use functions for
creating and customizing ggplot2.install.packages("ggpubr", repos = "http://cran.us.r-project.org")
##
## The downloaded binary packages are in
## /var/folders/83/h8ylmntn5xq3lf6j12yhgy440000gn/T//RtmpLKvYu5/downloaded_packages
library(tidyverse)
library(lubridate)
library(skimr)
library(janitor)
library(ggalluvial)
library(ggpubr)
Datasets are imported and assigned their respective variable names
for easier manipulation. The respective dataset are: *
dailyActivity_merged.csv * sleepDay_merged.csv
* hourlySteps_merged.csv
daily_activity_df <- read_csv("dailyActivity_merged.csv", show_col_types = FALSE)
sleep_day_df <- read_csv("sleepDay_merged.csv", show_col_types = FALSE)
hourly_steps_df <- read_csv("hourlySteps_merged.csv", show_col_types = FALSE)
head(daily_activity_df)
## # A tibble: 6 × 15
## Id Activ…¹ Total…² Total…³ Track…⁴ Logge…⁵ VeryA…⁶ Moder…⁷ Light…⁸ Seden…⁹
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2… 13162 8.5 8.5 0 1.88 0.550 6.06 0
## 2 1.50e9 4/13/2… 10735 6.97 6.97 0 1.57 0.690 4.71 0
## 3 1.50e9 4/14/2… 10460 6.74 6.74 0 2.44 0.400 3.91 0
## 4 1.50e9 4/15/2… 9762 6.28 6.28 0 2.14 1.26 2.83 0
## 5 1.50e9 4/16/2… 12669 8.16 8.16 0 2.71 0.410 5.04 0
## 6 1.50e9 4/17/2… 9705 6.48 6.48 0 3.19 0.780 2.51 0
## # … with 5 more variables: VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## # LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>, and
## # abbreviated variable names ¹ActivityDate, ²TotalSteps, ³TotalDistance,
## # ⁴TrackerDistance, ⁵LoggedActivitiesDistance, ⁶VeryActiveDistance,
## # ⁷ModeratelyActiveDistance, ⁸LightActiveDistance, ⁹SedentaryActiveDistance
head(sleep_day_df)
## # A tibble: 6 × 5
## Id SleepDay TotalSleepRecords TotalMinutesAsleep TotalT…¹
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 1 327 346
## 2 1503960366 4/13/2016 12:00:00 AM 2 384 407
## 3 1503960366 4/15/2016 12:00:00 AM 1 412 442
## 4 1503960366 4/16/2016 12:00:00 AM 2 340 367
## 5 1503960366 4/17/2016 12:00:00 AM 1 700 712
## 6 1503960366 4/19/2016 12:00:00 AM 1 304 320
## # … with abbreviated variable name ¹TotalTimeInBed
head(hourly_steps_df)
## # A tibble: 6 × 3
## Id ActivityHour StepTotal
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 373
## 2 1503960366 4/12/2016 1:00:00 AM 160
## 3 1503960366 4/12/2016 2:00:00 AM 151
## 4 1503960366 4/12/2016 3:00:00 AM 0
## 5 1503960366 4/12/2016 4:00:00 AM 0
## 6 1503960366 4/12/2016 5:00:00 AM 0
According to the generic naming conventions in R, it is a good
practice to only use lowercase letters, numbers and underscores to name
the columns. The function clean_names() can do this
particular job efficiently. Try comparing the column names before and
after clean_names() is executed.
clean_names() is applied.colnames(daily_activity_df)
## [1] "Id" "ActivityDate"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories"
colnames(sleep_day_df)
## [1] "Id" "SleepDay" "TotalSleepRecords"
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
colnames(hourly_steps_df)
## [1] "Id" "ActivityHour" "StepTotal"
clean_names() is applieddaily_activity_df <- clean_names(daily_activity_df)
sleep_day_df <- clean_names(sleep_day_df)
hourly_steps_df <- clean_names(hourly_steps_df)
colnames(daily_activity_df)
## [1] "id" "activity_date"
## [3] "total_steps" "total_distance"
## [5] "tracker_distance" "logged_activities_distance"
## [7] "very_active_distance" "moderately_active_distance"
## [9] "light_active_distance" "sedentary_active_distance"
## [11] "very_active_minutes" "fairly_active_minutes"
## [13] "lightly_active_minutes" "sedentary_minutes"
## [15] "calories"
colnames(sleep_day_df)
## [1] "id" "sleep_day" "total_sleep_records"
## [4] "total_minutes_asleep" "total_time_in_bed"
colnames(hourly_steps_df)
## [1] "id" "activity_hour" "step_total"
As noted in the data overview, daily_activities has 33
distinct users and 31 days of activity data, while
daily_sleeps has 24 users and 31
days.hourlySteps_merged.csvhas 33 users and 736 hours of
data.
n_distinct(daily_activity_df$id)
## [1] 33
n_distinct(daily_activity_df$activity_date)
## [1] 31
n_distinct(sleep_day_df$id)
## [1] 24
n_distinct(sleep_day_df$sleep_day)
## [1] 31
n_distinct(hourly_steps_df$id)
## [1] 33
n_distinct(hourly_steps_df$activity_hour)
## [1] 736
First step in any data analysis flow of work is to check for any
duplicates and nulls that exist in the data. The function
sum(duplicated()) checks the number of duplicates, while
colSums(is.na()) checks the number of nulls in each column
of the tables.
sum(duplicated(daily_activity_df))
## [1] 0
sum(duplicated(sleep_day_df))
## [1] 3
sum(duplicated(hourly_steps_df))
## [1] 0
colSums(is.na(daily_activity_df))
## id activity_date
## 0 0
## total_steps total_distance
## 0 0
## tracker_distance logged_activities_distance
## 0 0
## very_active_distance moderately_active_distance
## 0 0
## light_active_distance sedentary_active_distance
## 0 0
## very_active_minutes fairly_active_minutes
## 0 0
## lightly_active_minutes sedentary_minutes
## 0 0
## calories
## 0
colSums(is.na(sleep_day_df))
## id sleep_day total_sleep_records
## 0 0 0
## total_minutes_asleep total_time_in_bed
## 0 0
colSums(is.na(hourly_steps_df))
## id activity_hour step_total
## 0 0 0
As shown above, there is only duplicates entry in the
daily_sleeps table and no null values in all 3 tables.
Duplicates in the table can be easily removed using the
distinct() function. We can then verify again that
duplicates has been removed.
sleep_day_df <- sleep_day_df %>% distinct()
sum(duplicated(sleep_day_df))
## [1] 0
Notice that id is in a number format
<dbl> and activity_date is in a
character format <chr> in the
daily_activity_df. It makes the analysis easier down the
road if we change these number format into an appropriate one. The
id acts as an identifier and hence can be changed to a
<chr> format while the activity_date
column can be changed to a date_time format to allow data manipulation
and calculation. We will also change the column name of
activity_date into date to be concise.
head(daily_activity_df)
## # A tibble: 6 × 15
## id activ…¹ total…² total…³ track…⁴ logge…⁵ very_…⁶ moder…⁷ light…⁸ seden…⁹
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2… 13162 8.5 8.5 0 1.88 0.550 6.06 0
## 2 1.50e9 4/13/2… 10735 6.97 6.97 0 1.57 0.690 4.71 0
## 3 1.50e9 4/14/2… 10460 6.74 6.74 0 2.44 0.400 3.91 0
## 4 1.50e9 4/15/2… 9762 6.28 6.28 0 2.14 1.26 2.83 0
## 5 1.50e9 4/16/2… 12669 8.16 8.16 0 2.71 0.410 5.04 0
## 6 1.50e9 4/17/2… 9705 6.48 6.48 0 3.19 0.780 2.51 0
## # … with 5 more variables: very_active_minutes <dbl>,
## # fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
## # sedentary_minutes <dbl>, calories <dbl>, and abbreviated variable names
## # ¹activity_date, ²total_steps, ³total_distance, ⁴tracker_distance,
## # ⁵logged_activities_distance, ⁶very_active_distance,
## # ⁷moderately_active_distance, ⁸light_active_distance,
## # ⁹sedentary_active_distance
daily_activity_df <- daily_activity_df %>%
mutate(id = as.character(id)) %>%
rename(date = activity_date) %>%
mutate(date = as_date(date, format="%m/%d/%Y"))
head(daily_activity_df)
## # A tibble: 6 × 15
## id date total_…¹ total…² track…³ logge…⁴ very_…⁵ moder…⁶ light…⁷
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1503960366 2016-04-12 13162 8.5 8.5 0 1.88 0.550 6.06
## 2 1503960366 2016-04-13 10735 6.97 6.97 0 1.57 0.690 4.71
## 3 1503960366 2016-04-14 10460 6.74 6.74 0 2.44 0.400 3.91
## 4 1503960366 2016-04-15 9762 6.28 6.28 0 2.14 1.26 2.83
## 5 1503960366 2016-04-16 12669 8.16 8.16 0 2.71 0.410 5.04
## 6 1503960366 2016-04-17 9705 6.48 6.48 0 3.19 0.780 2.51
## # … with 6 more variables: sedentary_active_distance <dbl>,
## # very_active_minutes <dbl>, fairly_active_minutes <dbl>,
## # lightly_active_minutes <dbl>, sedentary_minutes <dbl>, calories <dbl>, and
## # abbreviated variable names ¹total_steps, ²total_distance,
## # ³tracker_distance, ⁴logged_activities_distance, ⁵very_active_distance,
## # ⁶moderately_active_distance, ⁷light_active_distance
We will perform similar functions to the other 2 dataframes.
sleep_day_df <- sleep_day_df %>%
mutate(id = as.character(id)) %>%
rename(date = sleep_day) %>%
mutate(date = as_date(date, format="%m/%d/%Y"))
head(sleep_day_df)
## # A tibble: 6 × 5
## id date total_sleep_records total_minutes_asleep total_time_in…¹
## <chr> <date> <dbl> <dbl> <dbl>
## 1 1503960366 2016-04-12 1 327 346
## 2 1503960366 2016-04-13 2 384 407
## 3 1503960366 2016-04-15 1 412 442
## 4 1503960366 2016-04-16 2 340 367
## 5 1503960366 2016-04-17 1 700 712
## 6 1503960366 2016-04-19 1 304 320
## # … with abbreviated variable name ¹total_time_in_bed
hourly_steps_df <- hourly_steps_df %>%
mutate(id = as.character(id)) %>%
rename(date_time = activity_hour) %>%
mutate(date_time = as.POSIXct(date_time, format ="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone()))
head(hourly_steps_df)
## # A tibble: 6 × 3
## id date_time step_total
## <chr> <dttm> <dbl>
## 1 1503960366 2016-04-12 00:00:00 373
## 2 1503960366 2016-04-12 01:00:00 160
## 3 1503960366 2016-04-12 02:00:00 151
## 4 1503960366 2016-04-12 03:00:00 0
## 5 1503960366 2016-04-12 04:00:00 0
## 6 1503960366 2016-04-12 05:00:00 0
The data is now ready to be for transformation and analysis that provides insights to answer business questions. Several techniques of data analysis such as merging, grouping, filtering, calculating new values are used to distill insights from the data.
Users are grouped based on the total days they tracked their activities and recorded their sleep out of total of 31 days of data. The device usage frequency is categorized into:
Heavy: more than 20 days of tracked activity
Moderate: from 11 to 20 days of tracked activity
Light: less than 10 days of tracked activity
The sleep record frequency is categorized as:
Often: more than 20 days of recorded sleep
Sometimes: from 11 to 20 days of recorded sleep
Rarely: from 1 to 10 days of recorded sleep
Never: 0 days of recorded sleep
These user groups are then tagged to the respective id in a new
column named usage_frequency .
We will start with daily_activity_df. A new dataframe
active_tracked is assigned to capture the aggregated data
and assigned the respective usage_frequency base on conditions stated
above:
active_tracked <- daily_activity_df %>%
group_by(id) %>%
summarise(days_used = sum(n())) %>%
mutate(usage_level = case_when(
days_used <= 10 ~ "Light",
days_used <= 20 ~ "Moderate",
TRUE ~ "Heavy"
))
head(active_tracked)
## # A tibble: 6 × 3
## id days_used usage_level
## <chr> <int> <chr>
## 1 1503960366 31 Heavy
## 2 1624580081 31 Heavy
## 3 1644430081 30 Heavy
## 4 1844505072 31 Heavy
## 5 1927972279 31 Heavy
## 6 2022484408 31 Heavy
We will apply the same technique to the sleep_day_df.
sleep_log is the new dataframe assigned to capture the
aggregated data.
sleep_log <- sleep_day_df %>%
group_by(id) %>%
summarise(days_tracked = sum(n())) %>%
mutate(record_level = case_when(
days_tracked <= 10 ~ "Rarely",
days_tracked <= 20 ~ "Sometimes",
TRUE ~ "Often"
))
We will then combine these two dataframes into
joint_act_sleep using their id as the matching
criteria.
left_join() is used to combined the dataframes.
joint_act_sleep <- active_tracked %>%
left_join(sleep_log, by = "id")
head(joint_act_sleep)
## # A tibble: 6 × 5
## id days_used usage_level days_tracked record_level
## <chr> <int> <chr> <int> <chr>
## 1 1503960366 31 Heavy 25 Often
## 2 1624580081 31 Heavy NA <NA>
## 3 1644430081 30 Heavy 4 Rarely
## 4 1844505072 31 Heavy 3 Rarely
## 5 1927972279 31 Heavy 5 Rarely
## 6 2022484408 31 Heavy NA <NA>
From here, we see that there are null values in the combined
dataframe. This is because some users did not use the device to track
their sleep. We will assign 0 to days_tracked and “Never”
to record_level to modify the dataframe so that it is
easier to analyse.
joint_act_sleep <- joint_act_sleep %>%
mutate(days_tracked = replace(days_tracked, is.na(days_tracked), 0)) %>%
mutate(record_level = replace(record_level, is.na(record_level), "Never"))
head(joint_act_sleep)
## # A tibble: 6 × 5
## id days_used usage_level days_tracked record_level
## <chr> <int> <chr> <dbl> <chr>
## 1 1503960366 31 Heavy 25 Often
## 2 1624580081 31 Heavy 0 Never
## 3 1644430081 30 Heavy 4 Rarely
## 4 1844505072 31 Heavy 3 Rarely
## 5 1927972279 31 Heavy 5 Rarely
## 6 2022484408 31 Heavy 0 Never
Here, we can make a pie chart to see the distribution of usage level.
usage_level <- joint_act_sleep %>% group_by(usage_level) %>%
summarise(usage_sum = sum(n()))
usage_level <- usage_level %>% arrange(usage_sum)
hsize <- 2
usage_pie <- ggplot(usage_level, aes(x=hsize, y=usage_sum, fill=usage_level)) +
geom_col(color="black") +
geom_text(aes(label = usage_sum), position = position_stack(vjust=0.5)) +
coord_polar(theta = "y") +
guides(fill = guide_legend(title = "Usage Level")) +
scale_fill_brewer(direction = -1, labels = c("Heavy", "Moderate", "Light")) +
xlim(c(0.2, hsize + 0.5)) +
theme_void()
usage_pie + labs(title = "Distribution of Smart Device Usage Level"
)
We will also visualize how many users use their smart device to track their sleep.
sleep_pie_df <- joint_act_sleep %>% group_by(record_level) %>%
summarise(record_sum = sum(n()))
sleep_pie_df <- sleep_pie_df %>% arrange(record_sum)
sleep_pie_df
## # A tibble: 4 × 2
## record_level record_sum
## <chr> <int>
## 1 Sometimes 3
## 2 Never 9
## 3 Rarely 9
## 4 Often 12
sleep_pie <- ggplot(sleep_pie_df, aes(x=hsize, y=record_sum, fill=record_level)) +
geom_col(color="black") +
geom_text(aes(label = record_sum), position = position_stack(vjust=0.5)) +
coord_polar(theta = "y") +
guides(fill = guide_legend(title = "Record Level")) +
xlim(c(0.2, hsize + 0.5)) +
scale_fill_brewer(direction = 1, labels = c("Never", "Rarely", "Sometimes", "Often")) +
theme_void()
sleep_pie + labs(title="Usage of smart device for sleep tracking")
From above charts, we can see that a very high proportion of users utilize their smart device to track their daily activities, while the converse is true when it comes to sleep tracking. The reason could be users tend to wear their smart devices during the day while they leave the device to charge at night. Further correlation between sleep tracking usage and battery level could be investigated to confirm this hypothesis.
Here, we will investigate users’ activity level and to see if active people are more inclined to use smart device to track their activities. Activity level is defined as the proportion of time people spent in active state over a day. A day consists of 1440 minutes.
Because the active minutes are counted in four different activity levels, they will be converted into intensity values on a scale from 0 to 3.
The total sedentary minutes is multiplied by 0
The total lightly active minutes is multiplied by 1
The total fairly active minutes is multiplied by 2
The total very active minutes is multiplied by 3
A new data frame daily_intensity is assigned to
aggregate the relevant data from daily_activity_df and
calculating the intensity_level.
daily_intensity <- daily_activity_df %>%
select(id, date, total_steps,
sedentary = sedentary_minutes,
lightly = lightly_active_minutes,
fairly = fairly_active_minutes,
very_active = very_active_minutes,
) %>%
mutate(active_percentage = (sedentary*0 + lightly*1 + fairly*2 + very_active*3)/1440)
head(daily_intensity)
## # A tibble: 6 × 8
## id date total_steps sedentary lightly fairly very_active activ…¹
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1503960366 2016-04-12 13162 728 328 13 25 0.298
## 2 1503960366 2016-04-13 10735 776 217 19 21 0.221
## 3 1503960366 2016-04-14 10460 1218 181 11 30 0.203
## 4 1503960366 2016-04-15 9762 726 209 34 29 0.253
## 5 1503960366 2016-04-16 12669 773 221 10 36 0.242
## 6 1503960366 2016-04-17 9705 539 164 20 38 0.221
## # … with abbreviated variable name ¹active_percentage
Then we will select the days_usage column from
joint_act_sleep table and combine it with
daily_intensity table to create a new dataframe.
intensity_df <- daily_intensity %>%
select(id, active_percentage, total_steps) %>%
left_join(joint_act_sleep, by="id") %>%
group_by(id)
intensity_df2 <- intensity_df %>% group_by(id) %>%
summarise(sum_total_steps = sum(total_steps), average_active_percentage = mean(active_percentage))
intensity_df3 <- joint_act_sleep %>% select(id, usage_level) %>%
left_join(intensity_df2, by="id")
head(intensity_df3)
## # A tibble: 6 × 4
## id usage_level sum_total_steps average_active_percentage
## <chr> <chr> <dbl> <dbl>
## 1 1503960366 Heavy 375619 0.260
## 2 1624580081 Heavy 178061 0.133
## 3 1644430081 Heavy 218489 0.174
## 4 1844505072 Heavy 79982 0.0822
## 5 1927972279 Heavy 28400 0.0306
## 6 2022484408 Heavy 352490 0.281
We will then plot sum_total_steps vs active_percentage to see what correlation it has.
scatter_plot <- ggplot(intensity_df, aes(x=total_steps, y=active_percentage, color=usage_level)) +
geom_point() +
geom_smooth(method=lm)
scatter_plot
## `geom_smooth()` using formula 'y ~ x'
From the scatterplot, we can observe a positive correlation between total_steps and active_percentage variable. This means that higher active minutes is reflected in higher steps recorded by the smart devices. We can use total_steps as a proxy to classify the user base to see the distribution of user activeness.
Are active people more likely to track their daily activities? Here, we will attempt to find out the relationship between average steps per day and users quantity by categorizing them according to the categories below:
Sedentary: less than 5000 steps per day
Lightly active: from 5000 to 7499 steps per day
Fairly active: from 7500 to 10000 steps per day
Very active: more than 10000 steps per day
A new data frame daily_steps is created using the
aggregated data from daily_activity_df.
daily_steps <- daily_activity_df %>%
group_by(id) %>%
summarise(average_steps = mean(total_steps)) %>%
mutate(active_level=case_when(
average_steps <= 5000 ~ "Sedentary",
average_steps <= 7500 ~ "Lightly Active",
average_steps <= 10000 ~ "Fairly Active",
TRUE ~ "Very Active"
))
head(daily_steps)
## # A tibble: 6 × 3
## id average_steps active_level
## <chr> <dbl> <chr>
## 1 1503960366 12117. Very Active
## 2 1624580081 5744. Lightly Active
## 3 1644430081 7283. Lightly Active
## 4 1844505072 2580. Sedentary
## 5 1927972279 916. Sedentary
## 6 2022484408 11371. Very Active
user_level <- daily_steps %>% group_by(active_level) %>%
summarise(count = sum(n()))
step_pie = ggplot(user_level, aes(x=hsize, y=count, fill=active_level))+
geom_col(color="black")+
coord_polar(theta = "y") +
geom_text(aes(label = count),
position = position_stack(vjust = 0.5)) +
xlim(c(0.2, hsize + 0.5)) +
guides(fill = guide_legend(title = "Active level"))+
theme_void()
step_pie + labs(title="Distribution of smart devices user")
From the above pie chart, we observe that all users of smart devices are distributed rather evenly across all levels of activeness. Although the data points are quite limited in this visualization, if this is assumed to be representative of all user base, then we have a rather big market to tap into.
In this section, we will dive into the hourly_steps_df to explore if there is any patterns throughout the day where users track their steps.
The hourly_steps_df has a column named
date_time. We will separate the date and time entry into
separate columns to better manipulate them.
hourly_steps_1 <- hourly_steps_df %>%
separate(date_time, into = c("date", "time"), sep = " ") %>%
mutate(date = ymd(date)) %>%
mutate(time = str_sub(time, 1, 5))
hourly_steps_1
## # A tibble: 22,099 × 4
## id date time step_total
## <chr> <date> <chr> <dbl>
## 1 1503960366 2016-04-12 00:00 373
## 2 1503960366 2016-04-12 01:00 160
## 3 1503960366 2016-04-12 02:00 151
## 4 1503960366 2016-04-12 03:00 0
## 5 1503960366 2016-04-12 04:00 0
## 6 1503960366 2016-04-12 05:00 0
## 7 1503960366 2016-04-12 06:00 0
## 8 1503960366 2016-04-12 07:00 0
## 9 1503960366 2016-04-12 08:00 250
## 10 1503960366 2016-04-12 09:00 1864
## # … with 22,089 more rows
hourly_steps_1 <- hourly_steps_1 %>%
left_join(daily_steps, by="id")
head(hourly_steps_1)
## # A tibble: 6 × 6
## id date time step_total average_steps active_level
## <chr> <date> <chr> <dbl> <dbl> <chr>
## 1 1503960366 2016-04-12 00:00 373 12117. Very Active
## 2 1503960366 2016-04-12 01:00 160 12117. Very Active
## 3 1503960366 2016-04-12 02:00 151 12117. Very Active
## 4 1503960366 2016-04-12 03:00 0 12117. Very Active
## 5 1503960366 2016-04-12 04:00 0 12117. Very Active
## 6 1503960366 2016-04-12 05:00 0 12117. Very Active
Now, we are ready to create a line plot to see if we can find some patterns on users’ activity.
lg_data <- hourly_steps_1 %>%
group_by(active_level, time) %>%
summarise(hourly_steps=mean(step_total))
## `summarise()` has grouped output by 'active_level'. You can override using the
## `.groups` argument.
line_plot <- ggplot(lg_data, aes(x=time, y=hourly_steps, group=active_level, color=active_level)) +
geom_line(size=0.8) +
ggtitle("Time vs Average steps of Smart Device users")+
theme(axis.text.x = element_text(size = 12, angle = 90, vjust = 0.3),
axis.text.y = element_text(size = 14),
panel.border = element_rect(fill = "transparent"),
panel.background = element_blank(),
plot.title = element_text(size=17, hjust = .5),
legend.title = element_blank(),
legend.text = element_text(size=11))
line_plot + labs(x="Time of day", y="Hourly Steps") +
annotate("rect", xmin="08:00", xmax="11:00", ymin=-0, ymax=1300, alpha=0.2)+
annotate("rect", xmin="12:00", xmax="15:00", ymin=-0, ymax=1300, alpha=0.2)+
annotate("rect", xmin="17:00", xmax="20:00", ymin=-0, ymax=1300, alpha=0.2)+
scale_y_continuous(expand = c(0, 0), limits = c(0, 1300))+
annotate("text", x=10.5, y=1000, label="Morning", size=4)+
annotate("text", x=14.5, y=1150, label="Lunch", size=4)+
annotate("text", x=20, y=1200, label="After Work", size=4)
From the line plot above, we can observe that there is a definite pattern where recorded steps increased significantly at time periods where people are usually moving, namely the morning rush hour, going for lunch, and after work hours.
in this section, we will explore how often do user wear their smart device every day. This is an important factor to explore if we were to improve Bellabeat product to ensure it can support a user’s usage demand.
First, we will filter out all the minutes data from the daily_activity_df and sum it up. This will provide us a table with the total minutes used data for each day.
daily_use <- daily_activity_df %>%
select(id, date, very_active_minutes, lightly_active_minutes, fairly_active_minutes, sedentary_minutes) %>%
mutate(total_wear_minutes=very_active_minutes+lightly_active_minutes+fairly_active_minutes+sedentary_minutes)
head(daily_use)
## # A tibble: 6 × 7
## id date very_active_minutes lightly_ac…¹ fairl…² seden…³ total…⁴
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1503960366 2016-04-12 25 328 13 728 1094
## 2 1503960366 2016-04-13 21 217 19 776 1033
## 3 1503960366 2016-04-14 30 181 11 1218 1440
## 4 1503960366 2016-04-15 29 209 34 726 998
## 5 1503960366 2016-04-16 36 221 10 773 1040
## 6 1503960366 2016-04-17 38 164 20 539 761
## # … with abbreviated variable names ¹lightly_active_minutes,
## # ²fairly_active_minutes, ³sedentary_minutes, ⁴total_wear_minutes
daily_use_lvl = daily_use %>%
select(id, date, total_wear_minutes) %>%
left_join(daily_steps) %>%
select(id, date, total_wear_minutes, active_level)
## Joining, by = "id"
head(daily_use_lvl)
## # A tibble: 6 × 4
## id date total_wear_minutes active_level
## <chr> <date> <dbl> <chr>
## 1 1503960366 2016-04-12 1094 Very Active
## 2 1503960366 2016-04-13 1033 Very Active
## 3 1503960366 2016-04-14 1440 Very Active
## 4 1503960366 2016-04-15 998 Very Active
## 5 1503960366 2016-04-16 1040 Very Active
## 6 1503960366 2016-04-17 761 Very Active
We then classify users according to below categories:
daily_use_lvl<- daily_use_lvl %>%
mutate(daily_usage = case_when(
total_wear_minutes <= 720 ~ "Less than half day",
total_wear_minutes < 1440 ~ "More than half day",
total_wear_minutes == 1440 ~ "Full day"
))
daily_use_lvl
## # A tibble: 940 × 5
## id date total_wear_minutes active_level daily_usage
## <chr> <date> <dbl> <chr> <chr>
## 1 1503960366 2016-04-12 1094 Very Active More than half day
## 2 1503960366 2016-04-13 1033 Very Active More than half day
## 3 1503960366 2016-04-14 1440 Very Active Full day
## 4 1503960366 2016-04-15 998 Very Active More than half day
## 5 1503960366 2016-04-16 1040 Very Active More than half day
## 6 1503960366 2016-04-17 761 Very Active More than half day
## 7 1503960366 2016-04-18 1440 Very Active Full day
## 8 1503960366 2016-04-19 1120 Very Active More than half day
## 9 1503960366 2016-04-20 1063 Very Active More than half day
## 10 1503960366 2016-04-21 1076 Very Active More than half day
## # … with 930 more rows
daily_usage_pie_df <- daily_use_lvl %>%
group_by(daily_usage) %>%
summarise(count=sum(n()))
daily_usage_pie_df
## # A tibble: 3 × 2
## daily_usage count
## <chr> <int>
## 1 Full day 478
## 2 Less than half day 25
## 3 More than half day 437
We will use a pie chart to illustrate the distribution of users each day.
daily_usage_pie <- ggplot(daily_usage_pie_df, aes(x=hsize, y=count, fill=daily_usage)) +
geom_col(color="black") +
geom_text(aes(label = count), position = position_stack(vjust=0.5)) +
coord_polar(theta = "y") +
guides(fill = guide_legend(title = "Record Level")) +
xlim(c(0.2, hsize + 0.5)) +
scale_fill_brewer(direction = 1, labels = c("Full day", "Less than half day", "More than half day")) +
theme_void()
daily_usage_pie + labs(title="Distribution of daily usage")
From this pie chart, we see that 97% of users wear their smart device for more than 12 hours a day. This is very encouraging as this means a big window of time to provide targeted advertisement to them on Bellabeat products.
The key takeaways from this analysis are:
| Takeaways | Recommendation |
|---|---|
| Almost 88% of users wear their smart devices almost everyday | We can emphasis advertisement for Bellabeat’s Time product as an ideal smart device to track user’s activity in a day. As the Time device is also connected the Bellabeat app, targeted advertisement could be created to market our other products |
| Majority of users do not use their smart device for sleep tracking. | This could be a potential market to exploit. Bellabeat Leaf bracelet could be a gamechanger to change consumer mindset on wearing a smart devices to sleep as the bracelet form factor is not intrusive. |
| Over 97% of users use their device for more than half a day | We can tap into this large pool of users to provide more value added products as they are connected to smart devices for a majority of their waking time. |