Bellabeat is a high-tech manufacturer of health-focused products for women, co-founded by innovative artist Urška Sršen and mathematician Sando Mur. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market.
Their main product, the Bellabeat app, provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions.
Bellabeat’s product line includes the following:
While Bellabeat provides their own line of smart wellness products, our analysis will primarily focus on unlocking new growth opportunities for the Bellabeat app (and products) by analyzing consumer data from non-Bellabeat smart devices (namely, the Fitbit).
At the end of this analysis, we will provide high-level recommendations for Bellabeat that will launch the company into the global smart device market.
We are assuming that the Fitbit users within the data source being used for our analysis will be similar Bellabeat users. Further, we are assuming that Fitbit and Bellabeat users represent the average adult who uses wellness application to reach a better level of holistic health.
We are not expecting these users to be athletics, bodybuilders or individuals that require specialized wellness plans to achieve extreme level of fitness.
We have been tasked to identify trends in how consumers use non-Bellabeat smart device and provide high-level recommendations for the Bellabeat app. The insights we discover will then help guide the marketing strategy for the company, with the goal of turning Bellabeat into a larger player in the global smart device market.
Let’s begin our analysis!
This part of our analysis is broken into three parts - Data Collection, Data Cleaning and Processing, and Exploratory Data Analysis (EDA). Each part provide in-depth information about and explanation of our analysis process.
Please use the tabs to switch between the parts.
The ata source used for this analysis can be found on data science competition platform and online community site Kaggle and was made available by Kaggle account “Mobius”. Kaggle data source citation indicates that dataset was retrieved from general-purpose open repository site Zenodo and made available by Furberg, Robert; Brinton, Julia; Keating, Michael ; Ortiz, Alexa.
Kaggle and Zenodo provide the following statement regarding how the data was gathered:
“These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Individual reports can be parsed by export session ID (column A) or timestamp (column B). Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors / preferences.”
Data limitations and credibility were determined using the ROCCC guide:
Our recommendations on how to improve this data source can be found in the Appendices (7.2.1 Tables)
Both platforms are opensource and provide the following licensings:
Kaggle
Zenodo
After our examination of the data source, we picked the following datasets to use for our analysis:
(Full overview of datasets in next section)
The data source is available in 18 CSV files that can be downloaded from Kaggle and Zenodo. We took the data source located on Kaggle for our analysis. Files were copied into a working directory folder to preserve the original files. We began examining the data within each file using Excel (along with Pivot tables); however, switched to R due to finding that some of the files contained too much data for Excel.
SQL Server was considered as a possible tool for examination and cleaning but excluded due to the number of files. R was decided as the best tool to use considering the various file sizes and total number of files.
| File Name | Original File Type | Number of Rows (inc header) | Descriptions |
|---|---|---|---|
| dailyActivity_merged | CSV File | 941 | Overview of daily activity including steps, distance, calories, intensities. 31 days of data for 33 participants. Participants identified by 10-digit ID. Activity logged by date. Long data format. |
| dailyCalories_merged | CSV File | 941 | Overview of daily calories tracked. 31 days of data for 33 participants. Participants identified by 10-digit ID. Activity logged by date. Long data format. |
| dailyIntensities_merged | CSV File | 941 | Overview of daily intensities (for active level and distance) measured in minutes and distance. 31 days of data for 33 participants. Participants identified by 10-digit ID. Activity logged by date. Long data format. Blanks noticed in ID and ActivityDate columns. |
| dailySteps_merged | CSV File | 941 | Overview of daily total steps. 31 days of data for 33 participants. Participants identified by 10-digit ID. Activity logged by date. Long data format. |
| heartrate_seconds_merged** | CSV File | 2483659 | Overview of participants heartrate per second. 961274 seconds of data for 14 participants. Participants identified by 10-digit ID. Activity logged by date. |
| hourlyCalories_merged | CSV File | 22100 | Overview of hourly calories measured by hour. 737 hours of data for 33 participants. Participants identified by 10-digit ID. Activity logged by date and hour. Long data format. Blanks noticed in ID and ActivityHour columns. |
| hourlyIntensities_merged | CSV File | 22100 | Overview of hourly total and average intensity measured by hour. 737 hours of data for 33 participants. Participants identified by 10-digit ID. Activity logged by date and hour. Long data format. Blanks noticed in ID and ActivityHour columns |
| hourlySteps_merged | CSV File | 22100 | Overview of total steps per hour measured by hour. 737 hours of data for 33 participants. Participants identified by 10-digit ID. Activity logged by date and hour. Long data format. Blanks noticed in ID and ActivityHour columns. |
| minuteCaloriesNarrow_merged** | CSV File | 1325581 | Overview of calories burned every minute (calorie data point per row) measured in calories. 729 hours of data for 33 participants. Participants identified by 10-digit ID. Activity logged by date. |
| minuteCaloriesWide_merged | CSV File | 21646 | Overview of calories burned every minute (calorie data point per column) measured in calories. 729 hours of data for 33 participants. Participants identified by 10-digit ID. Activity logged by date. |
| minuteIntensitiesNarrow_merged** | CSV File | 1325581 | Overview of intensity reached every minute (intensity count data point per row) measured in intensity count. 729 hours of data for 33 participants. Participants identified by 10-digit ID. Activity logged by date. |
| minuteIntensitiesWide_merged | CSV File | 21646 | Overview of intensity reached every minute (calorie data point per column) measured in intensity count. 729 hours of data for 33 participants. Participants identified by 10-digit ID. Activity logged by date. |
| minuteMETsNarrow_merged** | CSV File | 1325581 | Overview of metabolic equivalent (MET) every minute (MET data point per row). 44160 minutes of data for 33 participants. Participants identified by 10-digit ID. Activity logged by date. |
| minuteSleep_merged | CSV File | 188522 | Overview of minutes slept (minute data point per row) 49773 minutes of data for 24 participants. Participants identified by 10-digit ID. Activity logged by date. |
| minuteStepsNarrow_merged** | CSV File | 1325581 | Overview of intensity reached every minute (step data point per row) measured in number of steps. 729 hours of data for 33 participants. Participants identified by 10-digit ID. Activity logged by date. |
| minuteStepsWide_merged | CSV File | 21646 | Overview of intensity reached every minute (step data point per column) measured in number of steps. 729 hours of data for 33 participants. Participants identified by 10-digit ID. Activity logged by date. |
| sleepDay_merged | CSV File | 414 | Overview of total sleep measured in records, minutes asleep and total time in bed. 31 days of data for 24 participants. Participants identified by 10-digit ID. Activity logged by date and hour. Long data format. Blanks noticed in ID and SleepDay columns. |
| weightLogInfo_merged | CSV File | 68 | Overview of weight logging information measured in kg and lbs. 31 days of data for 8 participants. BMI logged along with LogId. Purpose of Fat and IsManualReport columns are unclear. Participants identified by 10-digit ID. Activity logged by date. Blanks noticed in ID and Date columns. |
Additional Notes:
Several packages are needed for our analysis from reading the CSV files to creating custom colors. Please ensure that the below code chuck is ran prior to running any other code. It will ensure that installation is needed afterwards.
# If needed, please switch your directory to your working dir
# setwd('path')
# Sets the CRAN mirror
options(repos = c(CRAN = "https://cran.r-project.org/"))
# Adds needed packages are to vector list and assigning to variable to cut down on work
packages <- c("dplyr",
"tidyverse",
"ggplot2",
"skimr",
"readr",
"lubridate",
"janitor",
"ggpubr",
"ggrepel",
"scales",
"remotes")
# Checks for packages that aren't installed
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
# Loads all required packages at once
sapply(packages, library, character.only=TRUE)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.4
## ✔ ggplot2 3.4.3 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'janitor'
##
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
##
##
##
## Attaching package: 'scales'
##
##
## The following object is masked from 'package:purrr':
##
## discard
##
##
## The following object is masked from 'package:readr':
##
## col_factor
## $dplyr
## [1] "dplyr" "stats" "graphics" "grDevices" "utils" "datasets"
## [7] "methods" "base"
##
## $tidyverse
## [1] "lubridate" "forcats" "stringr" "purrr" "readr" "tidyr"
## [7] "tibble" "ggplot2" "tidyverse" "dplyr" "stats" "graphics"
## [13] "grDevices" "utils" "datasets" "methods" "base"
##
## $ggplot2
## [1] "lubridate" "forcats" "stringr" "purrr" "readr" "tidyr"
## [7] "tibble" "ggplot2" "tidyverse" "dplyr" "stats" "graphics"
## [13] "grDevices" "utils" "datasets" "methods" "base"
##
## $skimr
## [1] "skimr" "lubridate" "forcats" "stringr" "purrr" "readr"
## [7] "tidyr" "tibble" "ggplot2" "tidyverse" "dplyr" "stats"
## [13] "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## $readr
## [1] "skimr" "lubridate" "forcats" "stringr" "purrr" "readr"
## [7] "tidyr" "tibble" "ggplot2" "tidyverse" "dplyr" "stats"
## [13] "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## $lubridate
## [1] "skimr" "lubridate" "forcats" "stringr" "purrr" "readr"
## [7] "tidyr" "tibble" "ggplot2" "tidyverse" "dplyr" "stats"
## [13] "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## $janitor
## [1] "janitor" "skimr" "lubridate" "forcats" "stringr" "purrr"
## [7] "readr" "tidyr" "tibble" "ggplot2" "tidyverse" "dplyr"
## [13] "stats" "graphics" "grDevices" "utils" "datasets" "methods"
## [19] "base"
##
## $ggpubr
## [1] "ggpubr" "janitor" "skimr" "lubridate" "forcats" "stringr"
## [7] "purrr" "readr" "tidyr" "tibble" "ggplot2" "tidyverse"
## [13] "dplyr" "stats" "graphics" "grDevices" "utils" "datasets"
## [19] "methods" "base"
##
## $ggrepel
## [1] "ggrepel" "ggpubr" "janitor" "skimr" "lubridate" "forcats"
## [7] "stringr" "purrr" "readr" "tidyr" "tibble" "ggplot2"
## [13] "tidyverse" "dplyr" "stats" "graphics" "grDevices" "utils"
## [19] "datasets" "methods" "base"
##
## $scales
## [1] "scales" "ggrepel" "ggpubr" "janitor" "skimr" "lubridate"
## [7] "forcats" "stringr" "purrr" "readr" "tidyr" "tibble"
## [13] "ggplot2" "tidyverse" "dplyr" "stats" "graphics" "grDevices"
## [19] "utils" "datasets" "methods" "base"
##
## $remotes
## [1] "remotes" "scales" "ggrepel" "ggpubr" "janitor" "skimr"
## [7] "lubridate" "forcats" "stringr" "purrr" "readr" "tidyr"
## [13] "tibble" "ggplot2" "tidyverse" "dplyr" "stats" "graphics"
## [19] "grDevices" "utils" "datasets" "methods" "base"
# This remote package is needed for correlation viz
# (For more info, https://rdrr.io/github/smin95/sesplot/man/sm_statCorr.html)
# remotes::install_github("smin95/sesplot")
library(smplot2)## Updated tutorial for smplot: smin95.github.io/dataviz/
Data needed to thoroughly examined prior to performing our analysis. Smaller files were looked at using Excel and Pivot tables. After we hit a large file, we moved all files to R.
Using read_csv from the readr package, we
read the CSV files and stored them into dataframe variables:
# Assigns CSV files to dataframe variables for ease
daily_activity <- read_csv("dailyActivity_merged.csv")## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 22099 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (2): Id, StepTotal
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#daily_calories <- read_csv("dailyCalories_merged.csv")
#daily_intensities <- read_csv("dailyIntensities_merged.csv")
#daily_steps <- read_csv("dailySteps_merged.csv")
#heartrate_seconds <- read_csv("heartrateSeconds_merged.csv")
#hourly_calories <- read_csv("hourlyCalories_merged.csv")
#hourly_intensities <- read_csv("hourlyIntensities_merged.csv")
#minute_calories_long <- read_csv("minuteCaloriesNarrow_merged.csv")
#minute_calories_wide <- read_csv("minuteCaloriesWide_merged.csv")
#minute_intensities_long <- read_csv("minuteIntensitiesNarrow_merged.csv")
#minute_intensities_wide <- read_csv("minuteIntensitiesWide_merged.csv")
#minute_sleep <- read_csv("minuteSleep_merged.csv")
#minute_steps_long <- read_csv("minuteStepsNarrow_merged.csv")
#minute_steps_wide <- read_csv("minuteStepsWide_merged.csv")
#daily_sleep <- read_csv("sleepDay_merged.csv")
#weight_log_info <- read_csv("weightLogInfo_merged.csv")We examined larger files that could not be opened with Excel deeper
separately using functions such as head and
n_unique. n_unique was used to see how many
unique identifies were contained in the dataframes.
Lastly, we manually verified the results from n_unique
by selecting only distinct identifies.
## Rows: 2483658 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Time
## dbl (2): Id, Value
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## [1] 14
## [1] 961274
heartrate_seconds %>%
select(Time) %>%
distinct() %>%
count() # verifies there are 961274 unique Times## Rows: 1325580 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityMinute
## dbl (2): Id, METs
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## [1] 33
## [1] 44160
minutes_mets_long %>%
select(ActivityMinute) %>%
distinct() %>%
count() # verifies there are 44160 unique minutesOnce we examined all of the files and picked the datasets that would
work best for our analysis, we verified that the dataset truly meet the
30-participant sample size. We again used n_unique, which
returned a result of 33 unique IDs for daily_activity and
hourly_steps dataframes. We manually confirmed this number
by selecting all distinct IDs from the dataframes.
## [1] 33
## [1] 33
Moving forward, we conduction a check to see if our dataframes had
any duplicates. We started with using duplicated function
to locate duplicates and nested it within a sum function
for a count, which returned a result of 0.
We manually verfied this result with a subset, that scanned through our dataframe from beginning to end. If any duplicates were present, the rows would be stored within a subset for further examination and verification.
Our dataframes did not contain any duplicates. However, if needed, the subset code can be used for other datasets in the future by changing the dataframe variable name.
## [1] 0
# returns all duplicates (if any) from beginning to end of dataframe;
# if found, row stored in new subset that's assigned to duplicates object
duplicates <- daily_activity[duplicated(daily_activity) | duplicated(daily_activity,fromLast = TRUE), ]
head(duplicates) # displays duplicated rows (if any)## [1] 0
duplicates <- hourly_steps[duplicated(hourly_steps) | duplicated(hourly_steps,fromLast = TRUE), ] # used for verification
head(duplicates) # displays duplicated rows (if any)Next, we worked to remove any NA values. We first created a holder variable to preservation an original copy of our dataframe. The count of this copy was then checked so that we can used the count for comparison. We then selected distinct values from our dataframes and dropped any NA. We counted our dataframe and matched the resulting number with the count of our preservation copy.
We manually verified by creating a subset that using the
is.na function across all rows. If any rows returned a
TRUE, they were stored within the subset for further examination and
verification. We can confirm that both dataframe did not contain any
NAs.
preservation_daily_activity <- daily_activity # original record preservation
count(preservation_daily_activity) # returns previous count of rows (940) prior to removaldaily_activity <- preservation_daily_activity %>%
distinct() %>%
drop_na() # drops any missing values
count(daily_activity) # returns count of 940 - no missing values found# returns any rows that contain NAs by applying is.na function across all rows in daily_activity:
# if found, stores them in new subset if marked as TRUE and assigned to missing_rows object
missing_rows <- daily_activity[apply(is.na(daily_activity), 1, any), ] # used for verification
head(missing_rows) # displays missing rows (if any)preservation_hourly_steps <- hourly_steps
count(preservation_hourly_steps) # returns previous count of rows (22099) prior to removalhourly_steps <- preservation_hourly_steps %>%
distinct() %>%
drop_na()
count(hourly_steps) # returns count of 22099 - no missing values foundmissing_rows <- hourly_steps[apply(is.na(hourly_steps), 1, any), ] # used for verification
head(missing_rows) # displays missing rows (if any)After removing duplicates and NAs, we moved on to standardizing the
column names in our dataframes. For this, we used the
clean_names function from the janitor package
and stored the results back into the dataframe.
If desired, we also added a small section that displays the columns of the preservation copies. This can be used for comparison or referencing.
# ensures columns are lowered, underscored, unique and contains standardized characters
daily_activity <- clean_names(daily_activity)
colnames(daily_activity) # displays cleaned column names## [1] "id" "activity_date"
## [3] "total_steps" "total_distance"
## [5] "tracker_distance" "logged_activities_distance"
## [7] "very_active_distance" "moderately_active_distance"
## [9] "light_active_distance" "sedentary_active_distance"
## [11] "very_active_minutes" "fairly_active_minutes"
## [13] "lightly_active_minutes" "sedentary_minutes"
## [15] "calories"
## [1] "id" "activity_hour" "step_total"
The date and times within our dataframes did not align to a
standarized format of ‘YYYY-MM-DD HH:MM:SS Am/PM’. We used two slightly
different method for each dataframe. Since daily_activty
only had a date column, we simply used as.Data from base R
combined with mdy from lubridate.
hourly_steps required a different approach since it
contained a date-time column. This dataframe had to be separated and
then merged so that date-time values can be formatted as ‘YYYY-MM-DD
HH:MM:SS Am/PM’ We did this because using as.POSIXCT prior to splitting
and mutating the date-time values will produce NAs due to the desired
format being different from the original format (‘MM/DD/YYYY HH:MM:SS
Am/PM’).
daily_activity <- daily_activity %>%
rename(date = activity_date) %>%
mutate(date = as.Date(mdy(date)))# renames `activity_date` column to date and standardizes dates to 'YYYY-MM-DD'
colnames(daily_activity) # displays column names for confirmation of change## [1] "id" "date"
## [3] "total_steps" "total_distance"
## [5] "tracker_distance" "logged_activities_distance"
## [7] "very_active_distance" "moderately_active_distance"
## [9] "light_active_distance" "sedentary_active_distance"
## [11] "very_active_minutes" "fairly_active_minutes"
## [13] "lightly_active_minutes" "sedentary_minutes"
## [15] "calories"
hourly_steps <- hourly_steps %>%
# separates `activity_hour` column into `date` and `time` columns and extra merges AM/PM back to `time`
separate(activity_hour, into = c("date", "time"), sep = " ", extra = "merge") %>%
mutate(date = as.Date(mdy(date))) %>% # changes `date` values to 'YYYY-MM-DD' format
unite(date_time, date, time, sep = " ") %>% # merges `date` and `time` columns into `date_time` column
# standardizes `date-time` values and sets timezone to UTC for accessibility
mutate(date_time = as.POSIXct(date_time, format = "%Y-%m-%d %I:%M:%S %p", tz = "UTC"))
head(hourly_steps) # displays new dttm object with desired formattingLast part of our date cleaning and processing is “Anomaly Hunting” or
finding any outliners and peculiarities. Here is an overview of our
findings with the use of skim_without_chartsfrom theskimr`
package:
daily_activity anomalies:
logged_activities_distance doesn’t include a lot
of data; will be removed from dataframe and excluded from analysissedentary_active_distance doesn’t include a lot
of data but can be useful for analysis; will be kept in dataframe and
included in analysishourly_steps anomalies:
step_total, could imply that
dataframe has skewness.| Name | daily_activity |
| Number of rows | 940 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| Date | 1 |
| numeric | 14 |
| ________________________ | |
| Group variables | None |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date | 0 | 1 | 2016-04-12 | 2016-05-12 | 2016-04-26 | 31 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 4.855407e+09 | 2.424805e+09 | 1503960366 | 2.320127e+09 | 4.445115e+09 | 6.962181e+09 | 8.877689e+09 |
| total_steps | 0 | 1 | 7.637910e+03 | 5.087150e+03 | 0 | 3.789750e+03 | 7.405500e+03 | 1.072700e+04 | 3.601900e+04 |
| total_distance | 0 | 1 | 5.490000e+00 | 3.920000e+00 | 0 | 2.620000e+00 | 5.240000e+00 | 7.710000e+00 | 2.803000e+01 |
| tracker_distance | 0 | 1 | 5.480000e+00 | 3.910000e+00 | 0 | 2.620000e+00 | 5.240000e+00 | 7.710000e+00 | 2.803000e+01 |
| logged_activities_distance | 0 | 1 | 1.100000e-01 | 6.200000e-01 | 0 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 4.940000e+00 |
| very_active_distance | 0 | 1 | 1.500000e+00 | 2.660000e+00 | 0 | 0.000000e+00 | 2.100000e-01 | 2.050000e+00 | 2.192000e+01 |
| moderately_active_distance | 0 | 1 | 5.700000e-01 | 8.800000e-01 | 0 | 0.000000e+00 | 2.400000e-01 | 8.000000e-01 | 6.480000e+00 |
| light_active_distance | 0 | 1 | 3.340000e+00 | 2.040000e+00 | 0 | 1.950000e+00 | 3.360000e+00 | 4.780000e+00 | 1.071000e+01 |
| sedentary_active_distance | 0 | 1 | 0.000000e+00 | 1.000000e-02 | 0 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.100000e-01 |
| very_active_minutes | 0 | 1 | 2.116000e+01 | 3.284000e+01 | 0 | 0.000000e+00 | 4.000000e+00 | 3.200000e+01 | 2.100000e+02 |
| fairly_active_minutes | 0 | 1 | 1.356000e+01 | 1.999000e+01 | 0 | 0.000000e+00 | 6.000000e+00 | 1.900000e+01 | 1.430000e+02 |
| lightly_active_minutes | 0 | 1 | 1.928100e+02 | 1.091700e+02 | 0 | 1.270000e+02 | 1.990000e+02 | 2.640000e+02 | 5.180000e+02 |
| sedentary_minutes | 0 | 1 | 9.912100e+02 | 3.012700e+02 | 0 | 7.297500e+02 | 1.057500e+03 | 1.229500e+03 | 1.440000e+03 |
| calories | 0 | 1 | 2.303610e+03 | 7.181700e+02 | 0 | 1.828500e+03 | 2.134000e+03 | 2.793250e+03 | 4.900000e+03 |
daily_activity <- select(daily_activity, -logged_activities_distance) # removes column `logged_activities_distance`
colnames(daily_activity) # confirms `logged_activities_distance` was removed## [1] "id" "date"
## [3] "total_steps" "total_distance"
## [5] "tracker_distance" "very_active_distance"
## [7] "moderately_active_distance" "light_active_distance"
## [9] "sedentary_active_distance" "very_active_minutes"
## [11] "fairly_active_minutes" "lightly_active_minutes"
## [13] "sedentary_minutes" "calories"
| Name | hourly_steps |
| Number of rows | 22099 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| numeric | 2 |
| POSIXct | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 4.848235e+09 | 2.4225e+09 | 1503960366 | 2320127002 | 4445114986 | 6962181067 | 8877689391 |
| step_total | 0 | 1 | 3.201700e+02 | 6.9038e+02 | 0 | 0 | 40 | 357 | 10554 |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date_time | 0 | 1 | 2016-04-12 | 2016-05-12 15:00:00 | 2016-04-26 06:00:00 | 736 |
hourly_steps. We first manually produced summary statistics
to verification the results from skim_without_charts and
then created a histogram for visual representation. However, please note
that, this visualization is not primary to our analysis - simply a
curious side quest :D.
# CURIOUS SIDE QUEST: Checking skewness with histogram and summary statistics
hs_stat <- hourly_steps %>%
summarize(mean(step_total), # 320
median(step_total), # 40
min(step_total), # 0
max(step_total)) # 10554
print(hs_stat) # based on summary stats alone, `step_total` is right-skewed since the mean is greater than the median## # A tibble: 1 × 4
## `mean(step_total)` `median(step_total)` `min(step_total)` `max(step_total)`
## <dbl> <dbl> <dbl> <dbl>
## 1 320. 40 0 10554
ggplot(hourly_steps, aes(x=step_total)) +
geom_histogram(binwidth=1500, fill="blue", color="black") # verifies right-skewness using a histogramAfter examining, verifying, cleaning and processing our data, we began our analysis. In this section, we display each step of our analysis along with any necessary explanation.
Prior to outlining our analysis, the below piece of code needs to be executed. This piece of code creates 1) a custom color with the same HEX color of Bellabeat’s logo (#e38872) and 2) a custom palette using a monochromatic scale based on Bellabeat’s logo HEX. This color and color palette are used for our visualizations. Our reasoning for create these two is for aesthetics and to show our appreciate to our client :)
# custom color taken from Bellabeat logo
bellabeat_color = "#e38872"
# custom monochromatic palette ("#D65232", "#DF775E", "#E89D8A", "#F1C2B6", "#FAE8E3")
bellabeat_palette <- colorRampPalette(c("#df4c29", "#fae9e5"))(5) Now for the fun part! We first wanted a way to categorize the
participants as the dataset did not come with any demographic
information such as age, gender, nationality, etc. We settle on creating
activity levels as a categorization system (called
activity_level as column but will be elaborated on in
“Classifying Participants” section).
We based our category structure on the information provided by a Medicine Net article titled “How Many Steps a Day Is Considered Active?”
A category structure is as follows:
| Activity Level | Requirement |
|---|---|
| Sedentary | Less than 5,000 steps per day |
| Barely Active | Between 5,000 and 7,499 steps per day |
| Somewhat Active | Between 7,500 and 9,999 steps per day |
| Active | Between 10,000 and 12,499 steps per day |
| Highly Active | More than 12,500 steps per day |
Before creating the categories and assigning them to participants, we
calculated the average daily steps for each participants. We grouped
daily_activity by ID, summarized each group by its mean or
average and stored the results into a new dataframe called
daily_average_steps.
daily_average_steps <- daily_activity %>%
group_by(id) %>% # groups data by participant id
# averages total steps, round to nearest whole number and stores in new column
summarize(average_steps = round(mean(total_steps)))
head(daily_average_steps) # displays average steps per idWith the daily average steps, we used case_when to apply
our categories to participants and stored this classifications in a new
column called active_level. We also created a new dataframe
called participant_activity_level that contains the new
category values.
participant_activity_level <- daily_average_steps %>%
mutate(
# classifies average steps into categories and stores in new column
activity_level = case_when( average_steps < 5000 ~ "Sedentary",
average_steps >= 5000 & average_steps < 7500 ~ "Barely Active",
average_steps >= 7500 & average_steps < 10000 ~ "Somewhat Active",
average_steps >= 10000 & average_steps < 12500 ~ "Active",
average_steps >= 12500 ~ "Highly Active"
)
)
head(participant_activity_level) # displays average steps and active level per idNow that we have categories attached to our participants, we wanted to see what percentage of our participants belonged in each category. We did this using several pipes to weave together summary calculations, column mutations, aggregation and data displaying.
activity_level_percentage <- participant_activity_level %>%
group_by(activity_level) %>% # groups participants by active level
summarize(total_participants = n()) %>% # calculates total number of participants
# calculates percentage of participants in active level and stores in new
mutate(active_percentage = (total_participants / sum(total_participants))) %>%
mutate(percentage_labels = percent(active_percentage)) # changes calculated percentages into labeled percentages
# displays active level by percentage in descending order
activity_level_percentage %>%
arrange(desc(active_percentage)) %>%
head() To gear us up for our visualization, we re-factored the activity levels so that “Highly Active” is first and “Sedentary” is last. If desired, please uncomment the first two lines in the below code chuck to see the original factor levels.
To re-factor, we used the base R factor function and
applied it to our activity_level column:
# Stores old factor levels for preservation
# preservation_activity_level_percentage_factor_levels <- activity_level_percentage$activity_level
# print(preservation_activity_level_percentage_factor_levels) # displays old factor levels
activity_level_percentage$activity_level <- activity_level_percentage$activity_level %>%
factor(levels = c("Highly Active",
"Active",
"Somewhat Active",
"Barely Active",
"Sedentary"
)
) # reorders factor levels for visualization
print(activity_level_percentage$activity_level) # displays new factor levels## [1] Active Barely Active Highly Active Sedentary
## [5] Somewhat Active
## Levels: Highly Active Active Somewhat Active Barely Active Sedentary
With our average daily steps and percentages calculated along with the re-factoring of our levels, we visualized our data using a pie chart. Within our code chuck, we provided step-by-step insight into each line for context:
# starts the plot of average active level percentages
ggplot(activity_level_percentage, aes(x = "", y = active_percentage, fill = activity_level)) +
# first part needed for pie chart, bar chat layer and ensuring that pre-calculated percentages are used
geom_bar(stat = "identity", width = 1) +
# second part needed for pie chart, y-values used for slices and rotation starts at 3'clock,
# using 0 instead of pi/2 for aesthetic reasons
coord_polar(theta = "y", start = 0) +
# centers title (subtitle_ and sets font size to 14 (10)
theme_void() + # uses void theme for aesthetic reasons
theme(plot.title = element_text(hjust = 0.5, size = 14),
plot.subtitle = element_text(hjust = 0.5, size = 10)) +
# applies custom colors to pie slices (Highly Active = Darkest Tint and so forth)
scale_fill_manual(values = bellabeat_palette) +
geom_text(aes(label = percentage_labels),
position = position_stack(vjust = 0.5)) + # adds percentage labels on top of pie slices and centers them
labs(x = NULL,
y = NULL, # removes x- and y-axes titles
fill = "Activity Level", # changes legend title
title = "Participant Distribution",# adds chart title
subtitle = "% represents proportion participants that fall into each activity level") # adds chart subtitleOnce we observed the proportion of participants that fell within each activity level, we explored distribution using a histogram. With our histogram, we wanted to see where participants typically fell based on their activity level.
For our visualization, we created two variables for specific purposes:
xinterceptdaily_average_steps
has a small amount of observations but large average step values.mean_value <- mean(daily_average_steps$average_steps) # gives mean value for plotting
bin_width <- 2 * (IQR(daily_average_steps$average_steps) / nrow(daily_average_steps)^(1/3)) # Freedman-Diaconis ruleWith these two variables, we completed the below code chuck for our visualization:
ggplot(daily_average_steps, aes(x = average_steps)) +
# creates histogram and bins
geom_histogram(binwidth = bin_width, fill = bellabeat_color, color = bellabeat_palette[5]) +
geom_vline(aes(xintercept = mean_value),
linetype = "dashed",
linewidth = 1) + # plots line at mean value
theme_light() + # desired viz theme
theme(plot.title = element_text(hjust = 0.5, size = 14)) + # centers and enlarges chart title
labs(x = "Average Steps per Participant",
y = "Number of Participants",
title="Histogram of Average Daily Steps") + # adds text to x-,y-axis and chart title
annotate("text",
x = mean_value,
y = 0,
label = paste0("Mean = ", round(mean_value)),
fontface = "bold",
vjust = -40,
hjust = -0.4) # adds, color, and positions Mean text and value to chart near top and slightly to the rightKey Insight(s)
At this point, we obtained some important insights from our summary statistics, aggregations and/or visualizations:
- Most participants fall within Sedentary (24.2), Barely Active (27.3) and Somewhat Active (27.3) or walk no more than 10000 steps a day. Ideal step goal is 7500 so participants are reaching ideal.
- Distribution of average steps appears to be normal, with a slight tail to the right.
To gain some further insights outside of our pie chart and histogram, we pivoted slightly and explored which days and hours participants were most active. We again used participant’s average daily steps along with participant’s average distance to plot two bar charts.
We started first with analyzing active weekdays. Prior to creating our visualizations, we needed to do the following:
weekday_steps_distance
that will contain modified daily_activity valuesweekday) that would store each
weekday within daily_activityweekday so that the levels for
weekday_steps_distance follow the standard 7-day weekweekday_steps_distance by each weekday
and summarize the average daily steps and distance for each day.# adds new column to daily activity data that only includes weekday
weekday_steps_distance <- daily_activity %>%
mutate(weekday = weekdays(date))
# sets levels: Monday < Tuesday < Wednesday < Thursday < Friday < Saturday < Sunday
weekday_steps_distance$weekday <- weekday_steps_distance$weekday %>%
ordered(levels = c("Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday",
"Saturday",
"Sunday"))
# calculates average steps, distance and groups by weekday
weekday_steps_distance <- weekday_steps_distance %>%
group_by(weekday) %>%
summarize(weekday_average_steps = mean(total_steps),
weekday_average_distance = mean(total_distance)) # calculates average steps, distance and groups by weekdayWith this steps completed, we were able to complete our charts and
arranged them side-by-side using ggarrange:
ggarrange( # arranges both bar charts into one plot
ggplot(weekday_steps_distance, aes(weekday, weekday_average_steps)) + # charts for average steps
# conditional used for coloring only bars that meet ideal levels
geom_col(stat = "identity", aes(fill = ifelse(weekday_average_steps > 7500, "Ideal", "Below Ideal"))) +
geom_hline(aes(yintercept = 7500),
linewidth = 1) + #adds a level at recommended steps and distance
theme_light() + # desired viz theme
theme(axis.text.x = element_text(angle = 45, vjust = 0.5), # tilts x-axis text
plot.title = element_text(hjust = 0.5, size = 14), # centers and enlarges chart title
plot.subtitle = element_text(hjust = 0.5, size = 10), # centers and shrinks chart subtitle
legend.position = "none") + # removes legend
scale_fill_manual(values = c("Ideal" = bellabeat_palette[1], "Below Ideal" = bellabeat_palette[5])) + # supplies fill color for conditional
labs(x = NULL,
y = NULL,
title = "Average Steps per Weekday",
subtitle = "Recommended daily steps is 7500"), # adds labels
ggplot(weekday_steps_distance, aes(weekday, weekday_average_distance)) + # charts for average distance
# conditional used for coloring only bars that meet ideal levels
geom_col(stat = "identity", aes(fill = ifelse(weekday_average_distance > 5, "Ideal", "Below Ideal"))) +
geom_hline(aes(yintercept = 5),
linewidth = 1) +
theme_light() + # desired viz theme
theme(axis.text.x = element_text(angle = 45, vjust = 0.5),
plot.title = element_text(hjust = 0.5, size = 14),
plot.subtitle = element_text(hjust = 0.5, size = 10),
legend.position = "none") +
scale_fill_manual(values = c("Ideal" = bellabeat_palette[1], "Below Ideal" = bellabeat_palette[5])) +
labs(x = NULL,
y = NULL,
title = "Average Distance per Weekday",
subtitle = "Recommended daily distance is 5 miles")
)## Warning in geom_col(stat = "identity", aes(fill = ifelse(weekday_average_steps
## > : Ignoring unknown parameters: `stat`
## Warning in geom_col(stat = "identity", aes(fill =
## ifelse(weekday_average_distance > : Ignoring unknown parameters: `stat`
Next, we started analyzing active hours. Similarly to weekdays, we needed to do some prep work prior to creating our visualization:
date_time in
hourly_steps into two columns - date and
time.
date_timehourly_steps data by each hour and aggregate the
average stepshourly_steps <- hourly_steps %>%
mutate(date = as.Date(date_time),
time = format(date_time, "%H")) # separates time from date_time column for aggregating and plotting
hours <- hourly_steps %>%
group_by(time) %>%
summarize(average_hour_steps = mean(step_total)) # aggregates average steps by hourWith the prep work completed, we created our visualization:
ggplot(hours, aes(time, average_hour_steps)) + # plots average steps and time
geom_col(stat = "identity", aes(fill = average_hour_steps)) + # fills using average steps
theme_light() + # desired viz theme
theme(plot.title = element_text(hjust = 0.5, size = 14)) + # centers and enlargers chart title
# colors plot with custom palette and green color for contrast
scale_fill_gradient(low = "#0dfbb2", high = bellabeat_palette[1]) +
labs(x = "Hours",
y = "Average Steps",
fill = "Average Hourly Steps", # changes legend title
title = "Average Steps per Hour") # chart labels## Warning in geom_col(stat = "identity", aes(fill = average_hour_steps)):
## Ignoring unknown parameters: `stat`
Key Insight(s)
At this point, we obtained some important insights from our summary statistics, aggregations and/or visualizations:
- Participants were most active M, T, W, and S based on steps and achieved ideal distance. Sunday was the least active day.
- Users are mostly active between the hours of 8 AM and 7 PM, with most walking occurring 12-2 and 5-7 (most likely after lunch and dinner
Pivoting again, we wanted to see if there were any correlation
between participants daily total steps and calories logged. We used a
visualization to explore this and confirmed our r- and p- values using
cor.test.
sm_statCorr is a packaged that had to be installed from
a remote github since it is not located within CRAN at the moment. For
more information about this package, please visit https://rdrr.io/github/smin95/sesplot/man/sm_statCorr.html.
We chose the Pearson correlation method because we are using two continuous variables and want to see the linear relationship of these two variables.
ggplot(daily_activity, aes(total_steps, calories)) + # plots correlation chart
geom_jitter(stat = "identity") + # ensures points are not overlaying
sm_statCorr(color = bellabeat_palette[1],
corr_method = "pearson") + # adds corr coefficient, stat significance and corr line
theme_light() + # desired viz theme
theme(plot.title = element_text(hjust = 0.5, size = 14)) + # centers and enlargers chart title
labs(x = "Total Steps",
y = "Calories",
title = "Total Steps vs Calories") # chart labels## `geom_smooth()` using formula = 'y ~ x'
# verifies R- and p- values along with other statistical info
cor.test(daily_activity$total_steps, daily_activity$calories, method = "pearson") ##
## Pearson's product-moment correlation
##
## data: daily_activity$total_steps and daily_activity$calories
## t = 22.472, df = 938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5483688 0.6316184
## sample estimates:
## cor
## 0.5915681
Key Insight(s)
At this point, we obtained some important insights from our summary statistics, aggregations and/or visualizations:
- We are seeing a corr coefficient of .56 for daily steps vs calories, indicating a moderately positive relationship
After analyzing participants activity levels, we first explored how frequently participants used their devices during the 31 day survey. We secondly explored how long participants wore their devices on a daily basis.
To begin our analysis of participants daily usage of their device, we created a separate set of categories based on how days participants logged their activity. The categories are as follows:
| Usage Level | Requirement |
|---|---|
| No Usage | 0 days logged |
| Minimal Usage | 1 - 9 days logged |
| Moderate Usage | 10 - 19 days logged |
| Frequent Usage | 20+ days logged |
Now that we have created the categories, We grouped
daily_activity by ID, summarized each group by the number
of rows for each ID, assigned a usage level based on this summation, and
stored the results into a new dataframe called
daily_usage.
daily_usage <- daily_activity %>%
group_by(id) %>% # groups rows by id
summarize(num_days_logged = sum(n())) %>% # sums each participant's row
# determines participants usage level based on summation
mutate(usage_level = case_when( num_days_logged == 0 ~ "No Usage",
num_days_logged >= 1 & num_days_logged < 10 ~ "Minimal Usage",
num_days_logged >= 10 & num_days_logged < 20 ~ "Moderate Usage",
num_days_logged >= 20 ~ "Frequent Usage")) Now that we have assigned usage levels to our participants, we moved to create a visualization to understand their usage trends further. Prior to creating the visualization, we needed to complete some prep work:
usage_level, count how many
participants fell in each level and then arrange levels in ascending
order.usage_level so that it matchess
our category system outlined above.participants_usage_level <- daily_usage %>%
group_by(usage_level) %>% # groups by usage level
summarize(num_participants = sum(n())) %>% # counts the number of participants in each usage level
arrange(num_participants) # sorts by number of participants
head(participants_usage_level)participants_usage_level$usage_level <- participants_usage_level$usage_level %>%
ordered(levels = c("No Usage",
"Minimal Usage",
"Moderate Usage",
"Frequent Usage")) # re-orders the usage levels for plot vizWith this prep done, we completed our visualization and provided comments for context:
ggplot(participants_usage_level, aes(x=usage_level, y=num_participants)) + # begins plot
geom_bar(stat="identity", color = bellabeat_palette[1], fill = bellabeat_palette[3]) + # creates, colors and fills plot
geom_text(aes(label = num_participants),
vjust = -.5,
fontface = "bold",
color = bellabeat_palette[1]) + #adds, adjust, bolden, and colors number of participants labels above bars
theme_light() + # desired viz theme
theme(plot.title = element_text(hjust = 0.5, size = 14)) + # centers and enlarges chart title
labs(x = "Usage Level",
y = "Number of Participants",
title = "Distribution of Usage Levels") # chart labelsWe explored how frequently participants logged their activity and now
want to explore how often participants wore their device throughout the
day. In order to do this, we needed to merge daily_activity
and daily_usage to have a comprehensive dataframe.
Next, we examined participants wear levels by percentages, which was retrieved from provide several calculations. The results from this calculations were then used to create device wear category levels.
daily_activity and stored this total into a new column
called total_miniutes_worn.Wear levels are as follows:
| Wear Level | Requirement |
|---|---|
| Barely Worn | device worn < 25 % of the day |
| Somewhat Worn | device worn 25 - 59% of the day |
| Mostly Worn | device worn 50 - 74% of the day |
| Frequently Worn | device worn 75 - 99% of the day |
| Constantly Worn | device worn == 100 % of the day |
daily_activity_usage <- merge(daily_activity, daily_usage, by = c("id")) # mixing daily activity and daily usage together
head(daily_activity_usage)minutes_device_worn <- daily_activity_usage %>%
mutate(total_minutes_worn = very_active_minutes + fairly_active_minutes + lightly_active_minutes + sedentary_minutes) %>%
mutate(device_worn_percentage = round((total_minutes_worn / 1440) * 100)) %>%
mutate(device_worn_level = case_when(device_worn_percentage < 25 ~ "Barely Worn (< 25%)",
device_worn_percentage >= 25 & device_worn_percentage < 50 ~ "Somewhat Worn (25-50%)",
device_worn_percentage >= 50 & device_worn_percentage < 75 ~ "Mostly Worn (50-75%)",
device_worn_percentage >= 75 & device_worn_percentage < 100 ~ "Frequently Worn (75-99%)",
device_worn_percentage == 100 ~ "Constantly Worn (100%)"))
head(minutes_device_worn)Next, we performing several calculations with the use of pipes to obtaining percentages that represented how often participants of all activity level wore their devices through the day. We also created labels to represent this percentages for plotting.
Please note: We did consider analyzing device wear for each activity level (such as “Highly Active”, “Barely Active”, etc) but this seemed like an overkill. We decided to focus on all participants regardless of activity level.
minutes_device_worn_all_usage_percentage <- minutes_device_worn %>%
group_by(device_worn_level) %>% # groups by device wear level
summarize(minutes = n()) %>% # counts the number of rows
mutate(total_minutes = sum(minutes)) %>% # sums the rows
group_by(device_worn_level) %>% # groups again by device wear level
summarize(total_percentage = minutes / total_minutes) %>% # obtains percentage of minutes
mutate(percentage_labels = scales::percent(total_percentage)) # creates labels for plottingWe re-factored the levels for device_worn_level to
represent our categories:
minutes_device_worn_all_usage_percentage$device_worn_level <- minutes_device_worn_all_usage_percentage$device_worn_level %>%
factor(levels = c("Barely Worn (< 25%)",
"Somewhat Worn (25-50%)",
"Mostly Worn (50-75%)",
"Frequently Worn (75-99%)",
"Constantly Worn (100%)"))This code chuck was created because percentage labels for ‘Barely Worn’ and ‘Somewhat Worn’ were overlapping. To remedy this, we first calculated the y position for the labels by obtaining the cumulative sum. Then the final positions of the labels were determined by calculating a midpoint from each slice of our pie chart. These two ensures that our labels do not overlap.
# determines position of labels
minutes_device_worn_all_usage_percentage$cumsum <- cumsum(minutes_device_worn_all_usage_percentage$total_percentage)
# determines y position of the labels
minutes_device_worn_all_usage_percentage$midpoint <- minutes_device_worn_all_usage_percentage$cumsum - (minutes_device_worn_all_usage_percentage$total_percentage / 2)Once we had everything we needed, we plotted our final visualization:
ggplot(minutes_device_worn_all_usage_percentage, aes(x = "", y = total_percentage, fill = device_worn_level)) +
# first part needed for pie chart, bar chart layer and ensuring that pre-calculated percentages are used
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") + # second part needed for pie chart, y-values used for slices
theme_void() + # uses void theme for aesthetic reasons
theme(plot.title = element_text(hjust = 0.5, size = 14),
plot.subtitle = element_text(hjust = 0.5, size = 10)) + # centers title (subtitle_ and sets font size to 14 (10)
# reverses and applies custom colors to pie slices (Constantly Worn = Darkest Tint and so forth)
scale_fill_manual(values = rev(bellabeat_palette)) +
geom_label_repel(aes(y = midpoint,label = percentage_labels), # handles overlapping percentage labels
segment.color = NA, # removes label lanes
show.legend = FALSE) + # removes repel's legend (thus, removing the random "a" on legend)
labs(x = NULL,
y = NULL, # removes x- and y-axes titles
fill = "Device Worn", # changes legend title
title = "Time Worn per Day",# adds chart title
subtitle = "% represents how often participants wore their device throughout the day") # adds chart subtitleKey Insight(s)
At this point, we obtained some important insights from our summary statistics, aggregations and/or visualizations:
- 30 of the 33 participants used their devices for most of the 31 days
- 51% of participants wore their device constantly (100% of the time) and 35% of participant wore it for most of the day (50-75%)
We are nearing the end of our analysis!
From our analysis, we compiled an in-dept overview our key sights along with questions we looked to answer. From this questions, we generated recommendations to not only answer our questions but also address our main task.
Our analysis provided us with 7 key insights that were evaluated, which resulted in 4 recommendations for Bellabeat’s marketing team. However, the best solution is detailed piece-by-piece below.
As a recap, our object or task for this analysis was to identify trends in how consumers use non-Bellabeat smart device and provide high-level recommendations for the Bellabeat app. The business task for Bellabeat is unlocking new growth opportunities for the Bellabeat app (and products) and launch the company into the global smart device market.
We identify several trends within our analysis and the best solution is to implement n-Day, Week-long and Month-long Step Challenges with Tiered Rewards as an Incentive. Our suggested breakdown of this incentive system is as follows:
n-day Step Challenge Information
Week-Long Step Challenge Information
Month-Long Step Challenge Information
Our best solution was achieved by considering our key insights found during our analysis and any questions that came up from such insights. The below section are the questions we considered along with the insight that provoke them:
(Why We Ask)
Of the 33 participants, we found that 30 of them used logged their information for most of the 31 day trail. Of this, 51% of them wore their device constantly or all day. This provides us with a good indication that users are willing and able to consistently use fitness related devices and application.
While this is a great start, we do not want to take this for guarantee as only 51% of the participants wore their devices all day. 35% of participants wore their devices only most of the day (everywhere between 50% and 75% of the day), which leaves an open market of participants that could be converted to all day users.
(Why We Ask)
Now, let’s turn to the participants activity during the 31 day trail. We see a moderate positive correlation between the amount of daily steps a participant took and the calories they logged. This could indicate that the most a participant walked, the more calories they needed due to the lost they accumulated.
Unfortunately, because the data source did not include demographic information, such as gender, this can only be an assuming. What isn’t an assuming is the days and hours that participants were most active. Based on participants logged steps, we see that they were most active on Mondays, Tuesdays, Wednesdays and Saturdays, between the hours of 8 AM to 7 PM, with highest hourly steps occurring between 12 and 2 as well as 5 and 7.
We can safely assume that walking was completed during lunch and dinner hours.
(Why We Ask)
If we dig a bit deeper into the participants activity, we are able to determine what active level they mostly fall in based on their average daily steps. The top two levels were Barely Active (27.3%) and Somewhat Active (27.3), mainly most participants logged a daily average of 5,000-10,000 steps, with the highest average of these two levels being 9795 steps.
Outside these two levels, 24.2 of participants fell in the Sedentary level, meaning they walked, on average, less than 5,000. The highest average steps logged for this group is 4797, just hitting the 5,000 mark.
(Why We Ask)
According to the CDC and NIH, “higher steps counts lowers morality risks from all causes”. Based on a study in 2020, “compared with taking 4,000 steps per day, a number considered to be low for adults, taking 8,000 steps per day was associated with a 51% lower risk for all-cause mortality (or death from all causes).” The study also showed that with increases steps, the risk was lowered even more.
It stated, “taking 12,000 steps per day was associated with a 65% lower risk compared with taking 4,000 steps.” Based on our analysis, only 6.1% of participants took more than an average of 12,000 steps daily, which is only 2 participants out of 33. 11 of the 33 participants took between an average of 8,000 to 12,000 steps per day.
Recommendations on how to improve the data source used for this analysis:
| Problem | Remedy |
|---|---|
| Unreliable | Use a larger sample size and notation of limitations/assumptions. A larger sample size can be obtained from already established social media engagement. |
| Unoriginal | Conduct a survey via social media networks as Bellabeat already has an established relationship with consumers. |
| Incomplete | Conduct a survey via social media networks as Bellabeat already has an established relationship with consumers. |
| Out-of-Date | Conduct a 6-month survey for current information. |
| Uncited | Use data collected from social media networks, Google search investment, advertisements, and customer engagement. |