This capstone case study is for the Google Data Analytics Professional Certificate. The scenario is as follows: You are a junior data analyst working on the marketing analyst team at Bellabeat. You have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights you discover will then help guide a marketing strategy for the company. You will present your analysis to the Bellabeat executive team along with your recommendations for Bellabeat’s marketing strategy.
Bellabeat is a high-tech wellness company that manufactures beautifully designed health-focused smart products for women. Even though Bellabeat is considered a small company, they have grown rapidly and have offices around the world.
The current advertising strategy consists of traditional methods but focuses on digital marketing. You may see or hear about Bellabeat on Facebook, Instagram, Twitter, YouTube, Google Search, Google Ads, radio stations, billboards, print, and television.
Bellabeat app: provides health related data for activity, sleep, menstrual cycle, and habits. The Bellabeat app connects to Bellabeat’s other smart wellness products.
Leaf: wellness tracker that can be worn as a bracelet, necklace, or clip
Time: wellness watch with a timeless classic look
Spring: smart water bottle that tracks daily water intake
Bellabeat membership: 24/7 personalized guidance on health & wellness based on lifestyle and goals.
Urška Sršen - Cofounder and
Chief Creative Officer Sando Mur - Mathematician, Cofounder, and key member of the executive team
Bellabeat marketing analytics team
Business Task: Discover growth opportunities to improve Bellabeat and their marketing strategy by analyzing smart device data.
Stakeholder Urška Sršen has asked for us to focus on one of Bellabeat’s products, analyze smart device usage data, find trends, and gain insights into how consumers use non-Bellabeat smart devices.
The Bellabeat product I’ve chosen for this case study is the Bellabeat app.
Stakeholder Urška Sršen encouraged the use of a specific public data set, FitBit Fitness Tracker Data. Sršen tells us that this data set may have some limitations and encourages us to add another data set to help address those limitations.
FitBit Fitness Tracker Data is an open-source data set made available by Mobius on Kaggle and was last updated two years ago. A CC0: Public Domain license link is located on Fitbit Fitness Tracker Data page and has been verified.
Surveys for Fitbit Fitness Tracker Data were generated and distributed by Amazon Mechanical Turk between March 12, 2016 - May 12, 2016 to thirty(30) eligible Fitbit users. The thirty(30) Fitbit users consented to submitting personal tracker data, including minute-level output for physical activity, heart rate, steps, and sleep monitoring.
Fitabase was used to collect and store survey data. Fitabase is a comprehensive data management platform designed to support innovative research projects using wearable and internet-connected devices. The Fitbit Fitness Tracker Data made available to us dates from April 12, 2016 - May 12, 2016.
FitBit Fitness Tracker Data is stored in 18 CSV Microsoft Excel files. To better understand these files and what they contain I downloaded them to my computer and opened them in Excel. The data format is long and structured. These are the files:
dailyActivity_merged.csv
dailyCalories_merged.csv
dailyIntensities_merged.csv
dailySteps_merged.csv
heartrate_seconds_merged.csv
hourlyCalories_merged.csv
hourlyIntensities_merged.csv
hourlySteps_merged.csv
minuteCaloriesNarrow_merged.csv
minuteCaloriesWide_merged.csv
minuteIntensitiesNarrow_merged.csv
minuteIntensitiesWide_merged.csv
minuteMETsNarrow_merged.csv
minuteSleep_merged.csv
minuteStepsNarrow_merged.csv
minuteStepsWide_merged.csv
sleepDay_merged.csv
weightLogInfo_merged.csv
These are the steps I took to sort and filter the data:
Made copies of all the CSV files in Excel.
Made each file into its own tab on one Excel spreadsheet called “preparing fitbit data sheets”.
These three data sets, dailyCalories_merged, dailyIntensities_merged, and dailySteps_merged, can be removed because their data is in dailyActivity_merged.
These six data sets, minuteCaloriesNarrow_merged, minuteCaloriesWide_merged, minuteIntensitiesNarrow_merged, minuteIntensitiesWide_merged, minuteStepsNarrow_merged, minuteStepsWide_merged I have chosen not to use after reviewing their data because the important information that we need from them is already located in: hourlySteps_merged, hourlyCalories_merged, and hourlyIntensities_merged.
The data set minuteSleep_merged has a column called ‘value’ that contains the only useful information for this particular data set. Each cell indicates the sleep state for that minute of the participant. 1=asleep, 2=restless, 3=awake. (Information provided by Fitabase Data Dictionary page 17). While this is good information, it will not be necessary to use for this case study so it will be removed.
The data sets heartrate_seconds_merged and minuteMETsNarrow_merged will also be removed. While it is good that there is lots of data for these data sets, it won’t be necessary to use them for this analysis.
There are now six data sets to work with.
In the data set dailyActivity_merged:
Columns ‘TotalDistance’ and ‘TrackerDistance’ have the same exact data so I’ll remove the column ‘TrackerDistance’.
Columns ‘LoggedActivitiesDistance’ and ‘SedentaryActiveDistance’ have almost no data, so I’ll remove these columns. They will not impact this analysis.
Column ‘LogId’ in data set weightLogInfo_merged and column ‘logId’ in data set minuteSleep_merged refer to unique log Id numbers in Fitbit’s system for that sleep record. This information isn’t useful to us because we already have the participants Id numbers in another column that is consistent with all the data sets. So these two columns will be removed.
Column ‘Fat’ in data set weightLogInfo_merged has two entries which is not enough data to compare anything to, so this column will be removed.
Filtered all data sets from the smallest Id number to the largest Id number.
Saved each data set with “prepared” at the end of the file name. These are the six data sets I’ll work with:
dailyActivity_merged_prepared
hourlyCalories_merged_prepared
hourlyIntensities_merged_prepared
hourlySteps_merged_prepared
sleepDay_merged_prepared
weightLogInfo_merged_prepared
There are many limitations for this data.
The survey takes place over a short duration of time, two months, and the data collected spans over one month. A longer period of time for surveys and data collection would provide more useful and accurate results.
The data was created six years ago, it’s currently 2022. Even though Mobius updated the notebook on Kaggle two years ago, the data is still old.
It is unknown what the gender, age, and location is for the participants.
There seems to be data for 33 participants but only 30 were mentioned in the metadata. This could be considered sample bias. Having only 33 participants is an under-representation of all the fitbit users out there.
In the data there is evidence that suggests some participants take their fitbit fitness tracker off which can affect the data and results.
There are a few terms and abbreviations used in the data that are unclear. Fortunately, in the discussion section on the Fitbit Fitness Tracker Data notebook on Kaggle, Laimis Andrijauskas provided a link to the Fitabase Data Dictionary which clears up a lot of confusion.
In the Fitabase Data Dictionary there is an important note on page 15 for the file minuteMETsNarrow_merged that states, “All MET values exported from Fitabase are multiplied by 10. Please divide by 10 to get accurate MET values.” This would have affected my results and understanding of this data if I didn’t know about the Fitabase Data Dictionary and I were to have used the data.
The integrity of the Fitbit Fitness Tracker Data is questionable based on the above limitations.
Some data sets have large amounts of data, so I will switch from Excel to R because it can handle large data sets easily. These are the packages I will be using in R:
tidyverse
tidyr
dplyr
lubridate
janitor
skimr
readr
tibble
ggplot2
yaml
Loading packages
library(tidyverse)
library(tidyr)
library(dplyr)
library(ggplot2)
library(janitor)
library(lubridate)
library(readr)
library(skimr)
library(tibble)
library(yaml)
Loading the data sets, cleaning the file names, and cleaning the column names so we have consistent, easy to read information.
#Changed 'dailyActivity_merged_prepared.csv' to 'daily_activity'
daily_activity <- read_csv("C:/Users/jessi/OneDrive/Desktop/prepared data/dailyActivity_merged_prepared.csv") %>%
clean_names()
#Changed 'hourlyCalories_merged_prepared.csv' to 'hourly_calories'
hourly_calories <- read_csv("C:/Users/jessi/OneDrive/Desktop/prepared data/hourlyCalories_merged_prepared.csv") %>%
clean_names()
#Changed 'hourlyIntensities_merged_prepared.csv' to 'hourly_intensities'
hourly_intensities <- read_csv("C:/Users/jessi/OneDrive/Desktop/prepared data/hourlyIntensities_merged_prepared.csv") %>%
clean_names()
#Changed 'hourlySteps_merged_prepared.csv' to 'hourly_steps'
hourly_steps <- read_csv("C:/Users/jessi/OneDrive/Desktop/prepared data/hourlySteps_merged_prepared.csv") %>%
clean_names()
#Changed 'sleepDay_merged_prepared.csv' to 'daily_sleep'
daily_sleep <- read_csv("C:/Users/jessi/OneDrive/Desktop/prepared data/sleepDay_merged_prepared.csv") %>%
clean_names()
#Changed 'weightLogInfo_merged_prepared.csv' to 'weight_log'
weight_log <- read_csv("C:/Users/jessi/OneDrive/Desktop/prepared data/weightLogInfo_merged_prepared.csv") %>%
clean_names()
Summary of the data sets.
str(daily_activity)
## spec_tbl_df [940 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ activity_date : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ total_steps : num [1:940] 13162 10735 10460 9762 12669 ...
## $ total_distance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ very_active_distance : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
## $ moderately_active_distance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
## $ light_active_distance : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
## $ very_active_minutes : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
## $ fairly_active_minutes : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
## $ lightly_active_minutes : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
## $ sedentary_minutes : num [1:940] 728 776 1218 726 773 ...
## $ calories : num [1:940] 1985 1797 1776 1745 1863 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. ActivityDate = col_character(),
## .. TotalSteps = col_double(),
## .. TotalDistance = col_double(),
## .. VeryActiveDistance = col_double(),
## .. ModeratelyActiveDistance = col_double(),
## .. LightActiveDistance = col_double(),
## .. VeryActiveMinutes = col_double(),
## .. FairlyActiveMinutes = col_double(),
## .. LightlyActiveMinutes = col_double(),
## .. SedentaryMinutes = col_double(),
## .. Calories = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(daily_sleep)
## spec_tbl_df [413 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:413] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ sleep_day : chr [1:413] "4/12/2016" "4/13/2016" "4/15/2016" "4/16/2016" ...
## $ total_sleep_records : num [1:413] 1 2 1 2 1 1 1 1 1 1 ...
## $ total_minutes_asleep: num [1:413] 327 384 412 340 700 304 360 325 361 430 ...
## $ total_time_in_bed : num [1:413] 346 407 442 367 712 320 377 364 384 449 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. SleepDay = col_character(),
## .. TotalSleepRecords = col_double(),
## .. TotalMinutesAsleep = col_double(),
## .. TotalTimeInBed = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(hourly_calories)
## spec_tbl_df [22,099 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:22099] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ activity_hour: chr [1:22099] "4/12/2016 0:00" "4/12/2016 1:00" "4/12/2016 2:00" "4/12/2016 3:00" ...
## $ calories : num [1:22099] 81 61 59 47 48 48 48 47 68 141 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. ActivityHour = col_character(),
## .. Calories = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(hourly_intensities)
## spec_tbl_df [22,099 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:22099] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ activity_hour : chr [1:22099] "4/12/2016 0:00" "4/12/2016 1:00" "4/12/2016 2:00" "4/12/2016 3:00" ...
## $ total_intensity : num [1:22099] 20 8 7 0 0 0 0 0 13 30 ...
## $ average_intensity: num [1:22099] 0.333 0.133 0.117 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. ActivityHour = col_character(),
## .. TotalIntensity = col_double(),
## .. AverageIntensity = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(hourly_steps)
## spec_tbl_df [22,099 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:22099] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ activity_hour: chr [1:22099] "4/12/2016 0:00" "4/12/2016 1:00" "4/12/2016 2:00" "4/12/2016 3:00" ...
## $ step_total : num [1:22099] 373 160 151 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. ActivityHour = col_character(),
## .. StepTotal = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(weight_log)
## spec_tbl_df [67 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:67] 1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
## $ date : chr [1:67] "5/2/2016 23:59" "5/3/2016 23:59" "4/13/2016 1:08" "4/21/2016 23:59" ...
## $ weight_kg : num [1:67] 52.6 52.6 133.5 56.7 57.3 ...
## $ weight_pounds : num [1:67] 116 116 294 125 126 ...
## $ bmi : num [1:67] 22.6 22.6 47.5 21.5 21.7 ...
## $ is_manual_report: logi [1:67] TRUE TRUE FALSE TRUE TRUE TRUE ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. Date = col_character(),
## .. WeightKg = col_double(),
## .. WeightPounds = col_double(),
## .. BMI = col_double(),
## .. IsManualReport = col_logical()
## .. )
## - attr(*, "problems")=<externalptr>
Viewing column names.
colnames(daily_activity)
## [1] "id" "activity_date"
## [3] "total_steps" "total_distance"
## [5] "very_active_distance" "moderately_active_distance"
## [7] "light_active_distance" "very_active_minutes"
## [9] "fairly_active_minutes" "lightly_active_minutes"
## [11] "sedentary_minutes" "calories"
colnames(daily_sleep)
## [1] "id" "sleep_day" "total_sleep_records"
## [4] "total_minutes_asleep" "total_time_in_bed"
colnames(hourly_calories)
## [1] "id" "activity_hour" "calories"
colnames(hourly_intensities)
## [1] "id" "activity_hour" "total_intensity"
## [4] "average_intensity"
colnames(hourly_steps)
## [1] "id" "activity_hour" "step_total"
colnames(weight_log)
## [1] "id" "date" "weight_kg" "weight_pounds"
## [5] "bmi" "is_manual_report"
Counting how many unique participant Id’s there actually are in each data set.
n_unique(daily_activity$id)
## [1] 33
n_unique(daily_sleep$id)
## [1] 24
n_unique(hourly_calories$id)
## [1] 33
n_unique(hourly_intensities$id)
## [1] 33
n_unique(hourly_steps$id)
## [1] 33
n_unique(weight_log$id)
## [1] 8
Some data sets have more than 30 participants and some have less. This can cause our data to be skewed since the metadata claimed there were 30 participants.
Looking for any missing values.
skim_without_charts(daily_activity)
| Name | daily_activity |
| Number of rows | 940 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 11 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| activity_date | 0 | 1 | 8 | 9 | 0 | 31 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 4.855407e+09 | 2.424805e+09 | 1503960366 | 2.320127e+09 | 4.445115e+09 | 6.962181e+09 | 8.877689e+09 |
| total_steps | 0 | 1 | 7.637910e+03 | 5.087150e+03 | 0 | 3.789750e+03 | 7.405500e+03 | 1.072700e+04 | 3.601900e+04 |
| total_distance | 0 | 1 | 5.490000e+00 | 3.920000e+00 | 0 | 2.620000e+00 | 5.240000e+00 | 7.710000e+00 | 2.803000e+01 |
| very_active_distance | 0 | 1 | 1.500000e+00 | 2.660000e+00 | 0 | 0.000000e+00 | 2.100000e-01 | 2.050000e+00 | 2.192000e+01 |
| moderately_active_distance | 0 | 1 | 5.700000e-01 | 8.800000e-01 | 0 | 0.000000e+00 | 2.400000e-01 | 8.000000e-01 | 6.480000e+00 |
| light_active_distance | 0 | 1 | 3.340000e+00 | 2.040000e+00 | 0 | 1.950000e+00 | 3.360000e+00 | 4.780000e+00 | 1.071000e+01 |
| very_active_minutes | 0 | 1 | 2.116000e+01 | 3.284000e+01 | 0 | 0.000000e+00 | 4.000000e+00 | 3.200000e+01 | 2.100000e+02 |
| fairly_active_minutes | 0 | 1 | 1.356000e+01 | 1.999000e+01 | 0 | 0.000000e+00 | 6.000000e+00 | 1.900000e+01 | 1.430000e+02 |
| lightly_active_minutes | 0 | 1 | 1.928100e+02 | 1.091700e+02 | 0 | 1.270000e+02 | 1.990000e+02 | 2.640000e+02 | 5.180000e+02 |
| sedentary_minutes | 0 | 1 | 9.912100e+02 | 3.012700e+02 | 0 | 7.297500e+02 | 1.057500e+03 | 1.229500e+03 | 1.440000e+03 |
| calories | 0 | 1 | 2.303610e+03 | 7.181700e+02 | 0 | 1.828500e+03 | 2.134000e+03 | 2.793250e+03 | 4.900000e+03 |
skim_without_charts(daily_sleep)
| Name | daily_sleep |
| Number of rows | 413 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| sleep_day | 0 | 1 | 8 | 9 | 0 | 31 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 5.000979e+09 | 2.06036e+09 | 1503960366 | 3977333714 | 4702921684 | 6962181067 | 8792009665 |
| total_sleep_records | 0 | 1 | 1.120000e+00 | 3.50000e-01 | 1 | 1 | 1 | 1 | 3 |
| total_minutes_asleep | 0 | 1 | 4.194700e+02 | 1.18340e+02 | 58 | 361 | 433 | 490 | 796 |
| total_time_in_bed | 0 | 1 | 4.586400e+02 | 1.27100e+02 | 61 | 403 | 463 | 526 | 961 |
skim_without_charts(hourly_calories)
| Name | hourly_calories |
| Number of rows | 22099 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| activity_hour | 0 | 1 | 13 | 15 | 0 | 736 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 4.848235e+09 | 2.4225e+09 | 1503960366 | 2320127002 | 4445114986 | 6962181067 | 8877689391 |
| calories | 0 | 1 | 9.739000e+01 | 6.0700e+01 | 42 | 63 | 83 | 108 | 948 |
skim_without_charts(hourly_intensities)
| Name | hourly_intensities |
| Number of rows | 22099 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| activity_hour | 0 | 1 | 13 | 15 | 0 | 736 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 4.848235e+09 | 2.4225e+09 | 1503960366 | 2320127002 | 4.445115e+09 | 6.962181e+09 | 8877689391 |
| total_intensity | 0 | 1 | 1.204000e+01 | 2.1130e+01 | 0 | 0 | 3.000000e+00 | 1.600000e+01 | 180 |
| average_intensity | 0 | 1 | 2.000000e-01 | 3.5000e-01 | 0 | 0 | 5.000000e-02 | 2.700000e-01 | 3 |
skim_without_charts(hourly_steps)
| Name | hourly_steps |
| Number of rows | 22099 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| activity_hour | 0 | 1 | 13 | 15 | 0 | 736 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 4.848235e+09 | 2.4225e+09 | 1503960366 | 2320127002 | 4445114986 | 6962181067 | 8877689391 |
| step_total | 0 | 1 | 3.201700e+02 | 6.9038e+02 | 0 | 0 | 40 | 357 | 10554 |
skim_without_charts(weight_log)
| Name | weight_log |
| Number of rows | 67 |
| Number of columns | 6 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| logical | 1 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| date | 0 | 1 | 13 | 15 | 0 | 56 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| is_manual_report | 0 | 1 | 0.61 | TRU: 41, FAL: 26 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 7.009282e+09 | 1.950322e+09 | 1.50396e+09 | 6.962181e+09 | 6.962181e+09 | 8.877689e+09 | 8.877689e+09 |
| weight_kg | 0 | 1 | 7.204000e+01 | 1.392000e+01 | 5.26000e+01 | 6.140000e+01 | 6.250000e+01 | 8.505000e+01 | 1.335000e+02 |
| weight_pounds | 0 | 1 | 1.588100e+02 | 3.070000e+01 | 1.15960e+02 | 1.353600e+02 | 1.377900e+02 | 1.875000e+02 | 2.943200e+02 |
| bmi | 0 | 1 | 2.519000e+01 | 3.070000e+00 | 2.14500e+01 | 2.396000e+01 | 2.439000e+01 | 2.556000e+01 | 4.754000e+01 |
The data tells us that there are no missing values.
Checking for duplicates.
sum(duplicated(daily_activity))
## [1] 0
sum(duplicated(daily_sleep))
## [1] 3
sum(duplicated(hourly_calories))
## [1] 0
sum(duplicated(hourly_intensities))
## [1] 0
sum(duplicated(hourly_steps))
## [1] 0
sum(duplicated(weight_log))
## [1] 0
There are 3 duplicates in ‘daily_sleep’ so those will need to be removed.
#Removing duplicates.
daily_sleep <- daily_sleep %>%
distinct()
#Checking that the duplicates were removed.
sum(duplicated(daily_sleep))
## [1] 0
All of the data sets have some kind of change that needs to happen with the column that contains the date or date and time. For 2 of the data sets I’m going to change the format of the date and change the column name to ‘date’. For 4 data sets I will change the format of the date and time, move date and time into their own columns called ‘date’ and ‘time’, and then remove the original column that contained both date and time.
daily_activity <- daily_activity %>%
rename(date = activity_date) %>%
mutate(date = as_date(date, format = "%m/%d/%Y"))
daily_sleep <- daily_sleep %>%
rename(date = sleep_day) %>%
mutate(date = as_date(date, format = "%m/%d/%Y"))
hourly_calories <- hourly_calories %>%
mutate(activity_hour = as.POSIXct(activity_hour, format = "%m/%d/%Y %H:%M" , tz=Sys.timezone())) %>%
separate(activity_hour, into = c('date', 'time'), sep = ' ', remove = TRUE)
hourly_intensities <- hourly_intensities %>%
mutate(activity_hour = as.POSIXct(activity_hour, format = "%m/%d/%Y %H:%M" , tz=Sys.timezone())) %>%
separate(activity_hour, into = c('date', 'time'), sep = ' ', remove = TRUE)
hourly_steps <- hourly_steps %>%
mutate(activity_hour = as.POSIXct(activity_hour, format = "%m/%d/%Y %H:%M" , tz=Sys.timezone())) %>%
separate(activity_hour, into = c('date', 'time'), sep = ' ', remove = TRUE)
weight_log <- weight_log %>%
mutate(date = as.POSIXct(date, format = "%m/%d/%Y %H:%M" , tz=Sys.timezone())) %>%
separate(date, into = c('date', 'time'), sep = ' ', remove = TRUE)
Double checking that the codes ran correctly
head(daily_activity)
## # A tibble: 6 × 12
## id date total_…¹ total…² very_…³ moder…⁴ light…⁵ very_…⁶ fairl…⁷
## <dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1503960366 2016-04-12 13162 8.5 1.88 0.550 6.06 25 13
## 2 1503960366 2016-04-13 10735 6.97 1.57 0.690 4.71 21 19
## 3 1503960366 2016-04-14 10460 6.74 2.44 0.400 3.91 30 11
## 4 1503960366 2016-04-15 9762 6.28 2.14 1.26 2.83 29 34
## 5 1503960366 2016-04-16 12669 8.16 2.71 0.410 5.04 36 10
## 6 1503960366 2016-04-17 9705 6.48 3.19 0.780 2.51 38 20
## # … with 3 more variables: lightly_active_minutes <dbl>,
## # sedentary_minutes <dbl>, calories <dbl>, and abbreviated variable names
## # ¹total_steps, ²total_distance, ³very_active_distance,
## # ⁴moderately_active_distance, ⁵light_active_distance, ⁶very_active_minutes,
## # ⁷fairly_active_minutes
## # ℹ Use `colnames()` to see all variable names
head(daily_sleep)
## # A tibble: 6 × 5
## id date total_sleep_records total_minutes_asleep total_time_in…¹
## <dbl> <date> <dbl> <dbl> <dbl>
## 1 1503960366 2016-04-12 1 327 346
## 2 1503960366 2016-04-13 2 384 407
## 3 1503960366 2016-04-15 1 412 442
## 4 1503960366 2016-04-16 2 340 367
## 5 1503960366 2016-04-17 1 700 712
## 6 1503960366 2016-04-19 1 304 320
## # … with abbreviated variable name ¹total_time_in_bed
head(hourly_calories)
## # A tibble: 6 × 4
## id date time calories
## <dbl> <chr> <chr> <dbl>
## 1 1503960366 2016-04-12 00:00:00 81
## 2 1503960366 2016-04-12 01:00:00 61
## 3 1503960366 2016-04-12 02:00:00 59
## 4 1503960366 2016-04-12 03:00:00 47
## 5 1503960366 2016-04-12 04:00:00 48
## 6 1503960366 2016-04-12 05:00:00 48
head(hourly_intensities)
## # A tibble: 6 × 5
## id date time total_intensity average_intensity
## <dbl> <chr> <chr> <dbl> <dbl>
## 1 1503960366 2016-04-12 00:00:00 20 0.333
## 2 1503960366 2016-04-12 01:00:00 8 0.133
## 3 1503960366 2016-04-12 02:00:00 7 0.117
## 4 1503960366 2016-04-12 03:00:00 0 0
## 5 1503960366 2016-04-12 04:00:00 0 0
## 6 1503960366 2016-04-12 05:00:00 0 0
head(hourly_steps)
## # A tibble: 6 × 4
## id date time step_total
## <dbl> <chr> <chr> <dbl>
## 1 1503960366 2016-04-12 00:00:00 373
## 2 1503960366 2016-04-12 01:00:00 160
## 3 1503960366 2016-04-12 02:00:00 151
## 4 1503960366 2016-04-12 03:00:00 0
## 5 1503960366 2016-04-12 04:00:00 0
## 6 1503960366 2016-04-12 05:00:00 0
head(weight_log)
## # A tibble: 6 × 7
## id date time weight_kg weight_pounds bmi is_manual_report
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <lgl>
## 1 1503960366 2016-05-02 23:59:00 52.6 116. 22.6 TRUE
## 2 1503960366 2016-05-03 23:59:00 52.6 116. 22.6 TRUE
## 3 1927972279 2016-04-13 01:08:00 134. 294. 47.5 FALSE
## 4 2873212765 2016-04-21 23:59:00 56.7 125. 21.5 TRUE
## 5 2873212765 2016-05-12 23:59:00 57.3 126. 21.7 TRUE
## 6 4319703577 2016-04-17 23:59:00 72.4 160. 27.5 TRUE
Merging daily_activity and daily_sleep into a new data set called daily_log using full_join() function with ‘id’ and ‘date’ as the common column. Also, daily_activity has 530 more rows than daily_sleep so zeros will need to fill in the blanks instead of NA so that our daily_activity data doesn’t get cut off where daily_sleep ends (at row 410).
daily_log <- full_join(daily_activity, daily_sleep, by = c('id', 'date'))
daily_log[is.na(daily_log)] <- 0
str(daily_log)
## tibble [940 × 15] (S3: tbl_df/tbl/data.frame)
## $ id : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ date : Date[1:940], format: "2016-04-12" "2016-04-13" ...
## $ total_steps : num [1:940] 13162 10735 10460 9762 12669 ...
## $ total_distance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ very_active_distance : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
## $ moderately_active_distance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
## $ light_active_distance : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
## $ very_active_minutes : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
## $ fairly_active_minutes : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
## $ lightly_active_minutes : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
## $ sedentary_minutes : num [1:940] 728 776 1218 726 773 ...
## $ calories : num [1:940] 1985 1797 1776 1745 1863 ...
## $ total_sleep_records : num [1:940] 1 2 0 1 2 1 0 1 1 1 ...
## $ total_minutes_asleep : num [1:940] 327 384 0 412 340 700 0 304 360 325 ...
## $ total_time_in_bed : num [1:940] 346 407 0 442 367 712 0 320 377 364 ...
Merging hourly_calories, hourly_intensities, and hourly_steps into a new data set called hourly_log.
hourly_log <- print(list(hourly_steps, hourly_calories, hourly_intensities) %>%
reduce(full_join, by = c('id', 'date', 'time')))
## # A tibble: 22,099 × 7
## id date time step_total calories total_intensity average_…¹
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1503960366 2016-04-12 00:00:00 373 81 20 0.333
## 2 1503960366 2016-04-12 01:00:00 160 61 8 0.133
## 3 1503960366 2016-04-12 02:00:00 151 59 7 0.117
## 4 1503960366 2016-04-12 03:00:00 0 47 0 0
## 5 1503960366 2016-04-12 04:00:00 0 48 0 0
## 6 1503960366 2016-04-12 05:00:00 0 48 0 0
## 7 1503960366 2016-04-12 06:00:00 0 48 0 0
## 8 1503960366 2016-04-12 07:00:00 0 47 0 0
## 9 1503960366 2016-04-12 08:00:00 250 68 13 0.217
## 10 1503960366 2016-04-12 09:00:00 1864 141 30 0.5
## # … with 22,089 more rows, and abbreviated variable name ¹average_intensity
## # ℹ Use `print(n = ...)` to see more rows
str(hourly_log)
## tibble [22,099 × 7] (S3: tbl_df/tbl/data.frame)
## $ id : num [1:22099] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ date : chr [1:22099] "2016-04-12" "2016-04-12" "2016-04-12" "2016-04-12" ...
## $ time : chr [1:22099] "00:00:00" "01:00:00" "02:00:00" "03:00:00" ...
## $ step_total : num [1:22099] 373 160 151 0 0 ...
## $ calories : num [1:22099] 81 61 59 47 48 48 48 47 68 141 ...
## $ total_intensity : num [1:22099] 20 8 7 0 0 0 0 0 13 30 ...
## $ average_intensity: num [1:22099] 0.333 0.133 0.117 0 0 ...
The data has been cleaned and there are now 3 data sets to work with:
daily_log
hourly_log
weight_log
The goal of this analysis is to gain insights into how consumers are using non-Bellabeat smart devices and give Bellabeat recommendations for improvements and new marketing strategies.
The Bellabeat product I chose to focus on was the Bellabeat App. Keeping that in mind, I focused on what it is participants tracked the most on their fitness tracker. This information would be stored directly in to an app for the participant to view. I’ll start my analysis by creating a few data sets that will show more detailed information.
Adding weekdays to all data sets. This will help delegate individual days of the week to all categories of information.
daily_log$weekday <- weekdays(as.Date(daily_log$date))
hourly_log$weekday <- weekdays(as.Date(hourly_log$date))
weight_log$weekday <- weekdays(as.Date(weight_log$date))
str(daily_log)
## tibble [940 × 16] (S3: tbl_df/tbl/data.frame)
## $ id : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ date : Date[1:940], format: "2016-04-12" "2016-04-13" ...
## $ total_steps : num [1:940] 13162 10735 10460 9762 12669 ...
## $ total_distance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ very_active_distance : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
## $ moderately_active_distance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
## $ light_active_distance : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
## $ very_active_minutes : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
## $ fairly_active_minutes : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
## $ lightly_active_minutes : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
## $ sedentary_minutes : num [1:940] 728 776 1218 726 773 ...
## $ calories : num [1:940] 1985 1797 1776 1745 1863 ...
## $ total_sleep_records : num [1:940] 1 2 0 1 2 1 0 1 1 1 ...
## $ total_minutes_asleep : num [1:940] 327 384 0 412 340 700 0 304 360 325 ...
## $ total_time_in_bed : num [1:940] 346 407 0 442 367 712 0 320 377 364 ...
## $ weekday : chr [1:940] "Tuesday" "Wednesday" "Thursday" "Friday" ...
str(hourly_log)
## tibble [22,099 × 8] (S3: tbl_df/tbl/data.frame)
## $ id : num [1:22099] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ date : chr [1:22099] "2016-04-12" "2016-04-12" "2016-04-12" "2016-04-12" ...
## $ time : chr [1:22099] "00:00:00" "01:00:00" "02:00:00" "03:00:00" ...
## $ step_total : num [1:22099] 373 160 151 0 0 ...
## $ calories : num [1:22099] 81 61 59 47 48 48 48 47 68 141 ...
## $ total_intensity : num [1:22099] 20 8 7 0 0 0 0 0 13 30 ...
## $ average_intensity: num [1:22099] 0.333 0.133 0.117 0 0 ...
## $ weekday : chr [1:22099] "Tuesday" "Tuesday" "Tuesday" "Tuesday" ...
str(weight_log)
## tibble [67 × 8] (S3: tbl_df/tbl/data.frame)
## $ id : num [1:67] 1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
## $ date : chr [1:67] "2016-05-02" "2016-05-03" "2016-04-13" "2016-04-21" ...
## $ time : chr [1:67] "23:59:00" "23:59:00" "01:08:00" "23:59:00" ...
## $ weight_kg : num [1:67] 52.6 52.6 133.5 56.7 57.3 ...
## $ weight_pounds : num [1:67] 116 116 294 125 126 ...
## $ bmi : num [1:67] 22.6 22.6 47.5 21.5 21.7 ...
## $ is_manual_report: logi [1:67] TRUE TRUE FALSE TRUE TRUE TRUE ...
## $ weekday : chr [1:67] "Monday" "Tuesday" "Wednesday" "Thursday" ...
Creating a column called ‘healthy_activity_minutes’ in the data set daily_log by adding ‘very_active_minutes’ and ‘fairly_active_minutes’ together. This will show how many participants exert themselves more and fall into a healthy activity level category.
daily_log$healthy_activity_minutes <- c(daily_log$very_active_minutes + daily_log$fairly_active_minutes)
head(daily_log)
## # A tibble: 6 × 17
## id date total_…¹ total…² very_…³ moder…⁴ light…⁵ very_…⁶ fairl…⁷
## <dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1503960366 2016-04-12 13162 8.5 1.88 0.550 6.06 25 13
## 2 1503960366 2016-04-13 10735 6.97 1.57 0.690 4.71 21 19
## 3 1503960366 2016-04-14 10460 6.74 2.44 0.400 3.91 30 11
## 4 1503960366 2016-04-15 9762 6.28 2.14 1.26 2.83 29 34
## 5 1503960366 2016-04-16 12669 8.16 2.71 0.410 5.04 36 10
## 6 1503960366 2016-04-17 9705 6.48 3.19 0.780 2.51 38 20
## # … with 8 more variables: lightly_active_minutes <dbl>,
## # sedentary_minutes <dbl>, calories <dbl>, total_sleep_records <dbl>,
## # total_minutes_asleep <dbl>, total_time_in_bed <dbl>, weekday <chr>,
## # healthy_activity_minutes <dbl>, and abbreviated variable names
## # ¹total_steps, ²total_distance, ³very_active_distance,
## # ⁴moderately_active_distance, ⁵light_active_distance, ⁶very_active_minutes,
## # ⁷fairly_active_minutes
## # ℹ Use `colnames()` to see all variable names
Creating a data set for totals by individual Id number. This will show some statistics for each individual participant for the total duration that data was submitted (31 days).
totals_per_id <- daily_log %>%
group_by(id) %>%
summarise(total_steps_per_id = sum(total_steps), total_healthy_activity_minutes = sum(very_active_minutes + fairly_active_minutes), total_days_tracked = sum(total_steps > 0), total_nights_tracked = sum(total_minutes_asleep > 0))
str(totals_per_id)
## tibble [33 × 5] (S3: tbl_df/tbl/data.frame)
## $ id : num [1:33] 1.50e+09 1.62e+09 1.64e+09 1.84e+09 1.93e+09 ...
## $ total_steps_per_id : num [1:33] 375619 178061 218489 79982 28400 ...
## $ total_healthy_activity_minutes: num [1:33] 1794 449 928 44 65 ...
## $ total_days_tracked : int [1:33] 30 31 30 21 17 31 31 31 18 31 ...
## $ total_nights_tracked : int [1:33] 25 0 4 3 5 0 28 1 15 0 ...
Creating a data set called weekday_log to see average daily steps based on the individual day of the week. This will show us the participants as a group for the average steps taken for each day of the week.
weekday_log <- daily_log %>%
mutate(weekday = weekdays(date))
weekday_log$weekday <-ordered(weekday_log$weekday, levels=c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday"))
weekday_log <-weekday_log %>%
group_by(weekday) %>%
summarize (average_daily_steps = mean(total_steps))
head(weekday_log)
## # A tibble: 6 × 2
## weekday average_daily_steps
## <ord> <dbl>
## 1 Monday 7781.
## 2 Tuesday 8125.
## 3 Wednesday 7559.
## 4 Thursday 7406.
## 5 Friday 7448.
## 6 Saturday 8153.
After creating new data sets some trends I noticed were: some days of the week were more popular than others for participants tracking steps, half of the participants engaged in healthy activity minutes, some participants hardly tracked anything, and very few participants tracked their weight.
My goal was to find trends in which categories participants tracked the most to see where their interests are when it comes to having a fitness tracker. Needing this information is imperative to being able to provide recommendations for Bellabeat.
My analysis and visuals shows that participants wore their fitness tracker during the day to track steps. Counting steps is a base feature that leads to other statistics such as calories lost, distance traveled and weight loss.
76% of participants did not track their weight, more than 50% of participants did not track their sleep for more than 10 days, and more than 60% of participants did not engaged in healthy activity minutes exceeding 1,000 minutes a month. Based on these statistics, tracking steps is what’s important to participants.
Bellabeat can use this information and my analysis to promote their products, focus on promoting what features consumers look for in a fitness tracker, and showcase other features that stand out from the rest. Bellabeat can offer more than what your typical fitness tracker can and in a more comfortable way.
Bellabeat is a wellness brand focused on women. With that being said, it is hard to analyze data with so many limitations and not knowing the genders of our participants. This knowledge would help us make better recommendations for Bellabeat. I would recommend that Bellabeat conduct it’s own survey over a longer period of time to produce more accurate data results.
By analyzing the smart device data that was suggested, these are my recommendation for Bellabeat to discover growth opportunities to improve the company and their marketing strategy:
Target Clientele - Knowing that the consumers of Bellabeat products are young adult women, discover advertising techniques targeting that demographic would be benificial.
Notifications -
Sending personalized notifications and messages makes a person feel more relevant and is more likely to be opened. This could be great motivation to perform the task that is asked of them.
Create a notification that alerts the person when they have been in a sedentary state for a long period of time during the day.
Remind them of their goals and suggest movement activity
Set up a weekly notification to record weight and reward them for doing so.
Promote healthy sleep habits with notifications suggesting winding down techniques, meditation, relaxing music or podcasts.
Notifications for drinking more water, bonuses if they have Bellabeats smart water bottle.
Social Media - Having a online community of women share their success stories, achievements, workouts, and meal plans brings more people together. Adding the ability to ask questions anonymously and creating challenges would be important too.
Challenges - Re-engagement after someone is sick or goes on vacation can be hard for that person to get back into routine. Creating different kinds of challenges can keep users interested and engaged. Continuous motivation is key when it comes to consistently tracking your health. Creating games to keep users maintaining healthy habits and reward them for passing challenges can keep people engaged too.
Promote Bellabeat Products through social media reaching out to groups with women, companies that have fitness programs for employees, fitness facilities, and young moms. Bellabeat has many products that are amazing and not only help women track their health but are comfortable to wear.
Having the option to not wear something on your wrist constantly had me immediately interested. From personal experience, I hated wearing my fitness tracker to bed and I forget it sometimes when I put it on the charger. Having a necklace or clipping something to a shirt is a game changer. I’m sure I’m not alone with this feeling and many other women out there could benefit from knowing these items exist.
Bellabeats smart water bottle and products you don’t have to charge constantly like the Leaf or Time should be highly advertised. I look forward to seeing more of these products on the market soon.
*THANK YOU FOR TAKING THE TIME TO READ MY CASE STUDY