Data Analytics Case Study

How Can a Wellness Technology Company Play It Smart?

Scenario

I am a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, co-founder and Chief Creative Officer of Bellabeat, believes that analysing smart device fitness data could help unlock new growth opportunities for the company. I have been asked to focus on one of Bellabeat’s products and analyse smart device data to gain insight into how consumers are using their smart devices. The insights that I discover will then help guide marketing strategy for the company. I will present my analysis to the Bellabeat executive team along with my high-level recommendations for Bellabeat’s marketing strategy.

Bellabeat Products

Bellabeat App: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
Leaf: : Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.
Time: : This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.
Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.
Bellabeat Membership: Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.

1. Ask

Urška Sršen asked me to analyse smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices. She then wanted me to select one Bellabeat product to apply these insights in my presentation.

1.1 Business Task

The business task is to analyse smart device usage data of non-Bellabeat smart devices to gain insight into relevant (successful and unsuccessful) consumer trends within the global smart device market, as well as to discover how to use these trends to apply to Bellabeat customers and to influence future Bellabeat marketing strategies. This is done by applying said insights to the Bellabeat App and to future products in order to maximise profits and growth for the company and to capitilse on Bellabeat’s rapidly growing consumer base in the smart device/tech-wellness space. Urška Sršen, Sando Mur (Bellabeat’s other co-founder and key member of the Bellabeat executive team), the Bellabeat executive team, the Bellabeat Marketing Analytics Team, and the Bellabeat investors are the key stakeholders that need to be considered in my data analysis and decisions.

2. Prepare

Sršen encouraged me to use the following public data that explores smart device users’ daily habits: FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius).

This Kaggle data set contains personal fitness trackers from 30 FitBit users. 30 eligible FitBit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activities, steps, and heart rate that can be used to explore users’ habits.

2.1 Notes about the Data

18 total data sets were provided in the FitBit Fitness Tracker Data link from above; they are individually stored in the form of .csv files. This analysis will instead focus on three data sets: the daily activity data set (‘activity_daily’), which contains merged data from other provided files like daily calories, daily intensities, and daily steps, the weight data set (‘weight’), and the daily sleep data set (‘sleep_daily’). These files contain relevant data that are also tracked by Bellabeat products - this will provide me with the most relevant and useful insights to solve the business task at hand.

2.2 Issues with the Data Credibility

I will be using ROCC to determine if there are any issues with bias or credibility in this data.

Reliabile: NOT reliable. This data only contains 30 selected individuals, which is not a representative sample bias of the 30+ million FitBit users. This would equate to a 95%/90% confidence level with a 18%/15% margin of error, respectively, which is not good. A sample size of more than 10 times the current amount would be a good minimum to provide a high confidence level (95%) and a low margin of error (5%). It should be noted, though, that according the the Central Limit Theorem (CLT), a sample of 30 is the smallest sample size for which the CLT is still valid. So, it is good that the provided data at least meets this metric. Moreover, all of the data was only obtained over the course of two months, which is not a long enough time to deduce accurate and reliable trends - I would prefer to have at least a year’s worth of data to find meaningful trends and insights.
Original: NOT original. The data set was generated by respondents to a distributed survey via Amazon Mechanical Turk. It would have been better if the data was supplied directly by FitBit.
Comprehensive: NOT comprehensive. The data is not comprehensive in the sense that other data (not present) would be useful to create a more accurate analysis (e.g., sex, age, height, etc.). Also, having more data from more individuals would help with the overall comprehensiveness; for example, having a more accurate sample bias of the 30+ million FitBit users. Again, the data was only collected over the course of two months, which is not comprehensive - I would prefer to have a year’s worth of data. In addition, there is no way of knowing if there was a bias in the selection of the individuals, or if it was selected at random. What was the criteria for selecting the 30 individuals? More details about the data would help.
Current: NOT current. The data was obtained six years ago, which is not an up-to-date data set representative of current trends.
Cited: Cited but NOT credible. The data came from Amazon Mechanical Turk, so it could be a reliable source or it could not. More research needs to be done about the integrity and credibility of Amazon Mechanical Turk.

Overall, the conclusions made from this analysis should be taken with a grain of salt, as the data integrity and credibility are lacking. However, the general insights could still prove to be useful by highlighting possible shortcomings in FitBit’s data tracking (by FitBit itself or by the individuals), which Bellabeat can then improve on through innovative features exclusive to the Bellabeat product line.

2.3 Loading Packages

Below are the packages that I will/might be using in this case study:

library(tidyverse)
library(lubridate)
library(ggplot2)
library(readr)
library(tidyr)
library(dplyr)
library(skimr)
library(janitor)
library(scales)

2.4 Importing Data Sets

activity_daily <- read_csv("dailyActivity_merged.csv")
weight <- read_csv("weightLogInfo_merged.csv")
sleep_daily <- read_csv("sleepDay_merged.csv")

3. Process

Now that I finished preparing the data, I will begin the processing step. Here, I will be verifying the data, then cleaning and transforming the data for analysis.

3.1 Verifying Data

The next step is to verify that the data sets were imported properly and to check for any obvious errors.

head(activity_daily)

## # A tibble: 6 × 15
##        Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie…
##     <dbl> <chr>             <dbl>         <dbl>           <dbl>            <dbl>
## 1  1.50e9 4/12/2016         13162          8.5             8.5                 0
## 2  1.50e9 4/13/2016         10735          6.97            6.97                0
## 3  1.50e9 4/14/2016         10460          6.74            6.74                0
## 4  1.50e9 4/15/2016          9762          6.28            6.28                0
## 5  1.50e9 4/16/2016         12669          8.16            8.16                0
## 6  1.50e9 4/17/2016          9705          6.48            6.48                0
## # … with 9 more variables: VeryActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>

head(weight)

## # A tibble: 6 × 8
##           Id Date       WeightKg WeightPounds   Fat   BMI IsManualReport   LogId
##        <dbl> <chr>         <dbl>        <dbl> <dbl> <dbl> <lgl>            <dbl>
## 1 1503960366 5/2/2016 …     52.6         116.    22  22.6 TRUE           1.46e12
## 2 1503960366 5/3/2016 …     52.6         116.    NA  22.6 TRUE           1.46e12
## 3 1927972279 4/13/2016…    134.          294.    NA  47.5 FALSE          1.46e12
## 4 2873212765 4/21/2016…     56.7         125.    NA  21.5 TRUE           1.46e12
## 5 2873212765 5/12/2016…     57.3         126.    NA  21.7 TRUE           1.46e12
## 6 4319703577 4/17/2016…     72.4         160.    25  27.5 TRUE           1.46e12

head(sleep_daily)

## # A tibble: 6 × 5
##           Id SleepDay           TotalSleepRecor… TotalMinutesAsl… TotalTimeInBed
##        <dbl> <chr>                         <dbl>            <dbl>          <dbl>
## 1 1503960366 4/12/2016 12:00:0…                1              327            346
## 2 1503960366 4/13/2016 12:00:0…                2              384            407
## 3 1503960366 4/15/2016 12:00:0…                1              412            442
## 4 1503960366 4/16/2016 12:00:0…                2              340            367
## 5 1503960366 4/17/2016 12:00:0…                1              700            712
## 6 1503960366 4/19/2016 12:00:0…                1              304            320

colnames(activity_daily)

##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"

colnames(weight)

## [1] "Id"             "Date"           "WeightKg"       "WeightPounds"  
## [5] "Fat"            "BMI"            "IsManualReport" "LogId"

colnames(sleep_daily)

## [1] "Id"                 "SleepDay"           "TotalSleepRecords" 
## [4] "TotalMinutesAsleep" "TotalTimeInBed"

view(activity_daily)
view(weight)
view(sleep_daily)

The first thing I noticed is that the data set only contains one month worth of data rather than two months of data, which raises further questions about the credibility and comprehensiveness of the data set.

The second thing I noticed is the consistency in the logging/tracking of the data. Not everyone was consistent in logging/tracking their data each day. Some people forgot to wear their FitBits, which recorded zero steps for certain days; this will skew any analysis, so I will remove them from the data set. Some people did not participate in recording their sleep or their weight. And some people did not participate for the whole month of time. This will make a complete and more in-depth analysis a lot more difficult to conduct than originally thought.

activity_daily_new <- activity_daily %>% 
  filter(TotalSteps !=0)

view(activity_daily_new)

Removing the zero steps will definitely help with the analysis! However, there are still very low number of step inputs present. There are also very low inputs for calories burnt. I will keep these in the data set for analysis, because perhaps those individuals were bedridden all day or they fasted all day. There is a lot of uncertainty here, and nothing can be said for certain without more detailed information. But, again, this questions the overall integrity and credibility of the data set.

The third thing I noticed is that the sleep data set and weight data set both contain the date and time in one column. It is best to separate into “Date” and “Time” columns, if I do decide to use the date as a way to analyse the data between the three files. However, whilst viewing the data sets, I noticed a large discrepancy in the number of unique IDs present, as well as inconsistencies in the daily logging/tracking of the individual’s weight and sleep.

weight_new <- weight %>% 
  separate(Date, c("Date", "Time"), " ")

## Warning: Expected 2 pieces. Additional pieces discarded in 67 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

sleep_daily_new <- sleep_daily %>% 
  separate(SleepDay, c("Date", "Time"), " ")

## Warning: Expected 2 pieces. Additional pieces discarded in 413 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

n_distinct(activity_daily_new$Id)

## [1] 33

n_distinct(weight_new$Id)

## [1] 8

n_distinct(sleep_daily_new$Id)

## [1] 24

Interestingly, not everyone involved in this survey provided tracking data for each data set. Only eight people entered their weight (where only two people logged/tracked most of the data), and only 24 people entered their sleep data. Also, there are 33 people recorded in the daily activity data, despite the data citation saying there are 30 people in the sample size. This further questions the credibility of the data as it seems to become even less comprehensive/reliable and more dirty the more I look into it. I might not be able to cross-analyse the data as much as I would like due to the number of incomplete and inconsistently tracked data. Due to the incompleteness of those data sets, I may just focus on the ‘activity_daily’ data set for a more focused analysis and then include some general recommendations for improvement on data logging/tracking consistencies for recording weight and sleep.

The fourth thing I noticed is that there could be some duplicated rows in some of the data sets. I will confirm this and delete the duplicated rows for cleaner data.

nrow(activity_daily_new)

## [1] 863

nrow(weight_new)

## [1] 67

nrow(sleep_daily_new)

## [1] 413

nrow(unique(activity_daily_new))

## [1] 863

nrow(unique(weight_new))

## [1] 67

nrow(unique(sleep_daily_new))

## [1] 410

Good thing I checked. I will clean up the daily sleep data set by creating a new data set with just the unique rows.

sleep_daily_new_v2 <- unique(sleep_daily_new)

Now I am ready to analyse the data.

3. Analyse

At this step, I will identify trends and relationships that I find. Any surprises about the data will also be mentioned. Hopefully I can discover valuable insights from my analysis that can help answer my business question.

First, I am going to take a look at a detailed summary of each data set.

skim_without_charts(activity_daily_new)

Data summary
Name	activity_daily_new
Number of rows	863
Number of columns	15
_______________________
Column type frequency:
character	1
numeric	14
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
ActivityDate	0	1	8	9	0	31	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100
Id	1	4.857542e+09	2.418405e+09	1503960366	2.320127e+09	4.445115e+09	6.962181e+09	8.877689e+09
TotalSteps	1	8.319390e+03	4.744970e+03	4	4.923000e+03	8.053000e+03	1.109250e+04	3.601900e+04
TotalDistance	1	5.980000e+00	3.720000e+00	0	3.370000e+00	5.590000e+00	7.900000e+00	2.803000e+01
TrackerDistance	1	5.960000e+00	3.700000e+00	0	3.370000e+00	5.590000e+00	7.880000e+00	2.803000e+01
LoggedActivitiesDistance	1	1.200000e-01	6.500000e-01	0	0.000000e+00	0.000000e+00	0.000000e+00	4.940000e+00
VeryActiveDistance	1	1.640000e+00	2.740000e+00	0	0.000000e+00	4.100000e-01	2.270000e+00	2.192000e+01
ModeratelyActiveDistance	1	6.200000e-01	9.100000e-01	0	0.000000e+00	3.100000e-01	8.700000e-01	6.480000e+00
LightActiveDistance	1	3.640000e+00	1.860000e+00	0	2.340000e+00	3.580000e+00	4.890000e+00	1.071000e+01
SedentaryActiveDistance	1	0.000000e+00	1.000000e-02	0	0.000000e+00	0.000000e+00	0.000000e+00	1.100000e-01
VeryActiveMinutes	1	2.302000e+01	3.365000e+01	0	0.000000e+00	7.000000e+00	3.500000e+01	2.100000e+02
FairlyActiveMinutes	1	1.478000e+01	2.043000e+01	0	0.000000e+00	8.000000e+00	2.100000e+01	1.430000e+02
LightlyActiveMinutes	1	2.100200e+02	9.678000e+01	0	1.465000e+02	2.080000e+02	2.720000e+02	5.180000e+02
SedentaryMinutes	1	9.557500e+02	2.802900e+02	0	7.215000e+02	1.021000e+03	1.189000e+03	1.440000e+03
Calories	1	2.361300e+03	7.027100e+02	52	1.855500e+03	2.220000e+03	2.832000e+03	4.900000e+03

skim_without_charts(weight_new)

Data summary
Name	weight_new
Number of rows	67
Number of columns	9
_______________________
Column type frequency:
character	2
logical	1
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Date	0	1	8	9	0	31	0
Time	0	1	7	8	0	26	0

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
IsManualReport	0	1	0.61	TRU: 41, FAL: 26

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
Id	0	1.00	7.009282e+09	1.950322e+09	1.503960e+09	6.962181e+09	6.962181e+09	8.877689e+09	8.877689e+09
WeightKg	0	1.00	7.204000e+01	1.392000e+01	5.260000e+01	6.140000e+01	6.250000e+01	8.505000e+01	1.335000e+02
WeightPounds	0	1.00	1.588100e+02	3.070000e+01	1.159600e+02	1.353600e+02	1.377900e+02	1.875000e+02	2.943200e+02
Fat	65	0.03	2.350000e+01	2.120000e+00	2.200000e+01	2.275000e+01	2.350000e+01	2.425000e+01	2.500000e+01
BMI	0	1.00	2.519000e+01	3.070000e+00	2.145000e+01	2.396000e+01	2.439000e+01	2.556000e+01	4.754000e+01
LogId	0	1.00	1.461772e+12	7.829948e+08	1.460444e+12	1.461079e+12	1.461802e+12	1.462375e+12	1.463098e+12

skim_without_charts(sleep_daily_new_v2)

Data summary
Name	sleep_daily_new_v2
Number of rows	410
Number of columns	6
_______________________
Column type frequency:
character	2
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Date	0	1	8	9	0	31	0
Time	0	1	8	8	0	1	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100
Id	1	4.994963e+09	2.060863e+09	1503960366	3.977334e+09	4702921684.0	6962181067	8792009665
TotalSleepRecords	1	1.120000e+00	3.500000e-01	1	1.000000e+00	1.0	1	3
TotalMinutesAsleep	1	4.191700e+02	1.186400e+02	58	3.610000e+02	432.5	490	796
TotalTimeInBed	1	4.584800e+02	1.274600e+02	61	4.037500e+02	463.0	526	961

This is a nice overview to make sure all of the necessary cleaning was done and if there are any immediate issues that stand out when doing a brief analysis from skimming. It looks good so far, but I would like to condense each file into only the columns that I want to use for my more focused analysis.

activity_daily_final <- activity_daily_new %>% 
  select(Id, ActivityDate, TotalSteps, VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes, SedentaryMinutes, Calories) %>% 
  rename(Date = ActivityDate)

weight_final <- weight_new %>% 
  select(Id, Date, BMI, WeightPounds, IsManualReport)

sleep_daily_final <- sleep_daily_new_v2 %>% 
  select(Id, Date, TotalMinutesAsleep, TotalTimeInBed)

A side note: I would like to convert the ‘Date’ column for ‘activity_daily_final’ into ‘WeekDays’ to find a relationship between which days of the week people are more consistent with logging/tracking their data, but unfortunately, I am constantly receiving ‘NA’ and ‘parsing errors’ when trying to use ‘lubridate’ functions and other functions. I will have to examine this issue further another time. But from my own experience, people tend to become more lackadaisical with logging/tracking data during the weekend and closer to the weekend - be it the Friday before or the Monday after.

Now, I want to take a look at a more specific summary of the values by using the ‘summary’ function.

summary(activity_daily_final)
summary(weight_final)
summary(sleep_daily_final)

3.1 Trends

The average Total Steps for an individual is 8053.
The average minutes for Very Active is 23.02, for Fairly Active is 14.78, for Lightly Active is 210, for Sedentary is 955.8.
The average BMI is 25.19.
The average minutes asleep is 419.2, whilst the average minutes in bed is 458.5.
Note, I should mention again that there are still outliers in the data that were not removed due to the lack of information regarding the data. These were kept in there just in case those extreme values were in fact legitimate. However, in the case that those values were not legitimate, the average values above may be slightly skewed, so keep that in mind.
A couple trends that I noticed earlier were that people were not consistent in tracking/logging their hours - some did not log their sleep or their weight every day or any day at all - and certain individuals who were consistently tracking/logging their data were not losing weight or seeing results over the month of data collection.

3.2 Conclusions from Trends

According to a joint research investigation by the National Cancer Institute (NCI), the National Institute on Aging (NIA), and the Centers for Disease Control and Prevention (CDC) (amongst other research studies), the ideal daily number of Total Steps one should achieve is 10,000. Thus, the average individual here is not reaching that minimum goal.
One reason for this is their activity level. The individuals spent on average 955.8 minutes a day being sedentary, that is on average 16 hours a day.
Since the average BMI is 25.19, this puts these individuals in the overweight category, according to the World Health Organisation (WHO).
It makes sense that overweight people are wearing FitBits to help them get in better shape, but they are not being active enough to do so. The more active someone is, the more steps they will achieve, the more calories they will burn, then the lower their BMI will be over a certain period of time.
Furthermore, the average person is getting just under the minimum recommended amount of sleep (7 hours) a person should get, according to the National Sleep Foundation (NSF). Luckily, the individuals are only spending a little over 30 minutes falling asleep (which is a lot faster than what I can do).

3.3 Questions to Answer

With all of these conclusions made, what can Bellabeat do to fix these issues?
How would Bellabeat be able to promote consistency with user logging/tracking of data?
How would Bellabeat help the average amount of users to achieve the minimum Total Steps a day?
How would Bellabeat help the users reach a healthy BMI whilst losing weight at a healthy and consistent rate?
How would Bellabeat help the users improve their sleep?

4. Share

Now, I will present my insights and important findings through visualisations.

VeryActiveMin <- sum(activity_daily_final$VeryActiveMinutes)
FairlyActiveMin <- sum(activity_daily_final$FairlyActiveMinutes)
LightlyActiveMin <- sum(activity_daily_final$LightlyActiveMinutes)
SedentaryMin <- sum(activity_daily_final$SedentaryMinutes)
TotalMin <- VeryActiveMin + FairlyActiveMin + LightlyActiveMin + SedentaryMin

slices <- c(VeryActiveMin,FairlyActiveMin,LightlyActiveMin,SedentaryMin)
lbls <- c("VeryActive","FairlyActive","LightlyActive","Sedentary")
pct <- round(slices/sum(slices)*100)
lbls <- paste(lbls, pct)
lbls <- paste(lbls, "%", sep="")
pie(slices, labels = lbls, col = rainbow(length(lbls)), main = "Percentage of Activity in Minutes")

With the Daily Activity Levels in minutes shown as percentages, it is visually clear that the individuals are not active enough, as 79% of the activity time for an entire month was spent sedentary for the average user. For a fitness tracking app, this is not good to see, especially when very active and fairly active make up only 2% and 1% of the total time, respectively.

ggplot(data=activity_daily_final) +
  geom_point(mapping=aes(x=TotalSteps, y=Calories), color="red") +
  geom_smooth(mapping=aes(x=TotalSteps, y=Calories)) +
  labs(title="The Relationship Between Total Steps and Calories Burned", x="Total Steps", y="Calories Burned (kcal)")

Although this relationship should be obvious, the more steps an individual takes, the more calories are burned. And of course, the more active a person is, the more steps they will take, which then means more calories are burned. The average person from this data set is only reaching about 8000 Total Steps for the day, which equates to just under 2500 calories burned for the day.

Without more information regarding the person’s age, sex, and height, it would be impossible to say exactly how many calories the person needs to burn to lose weight at a healthy rate. However, clearly they are not burning enough weight as the BMI and weight of the individuals who logged those values did not see an improvement over the month of data collection.

combined_data <- merge(activity_daily_final, sleep_daily_final, by="Id")

ggplot(data=combined_data) +
    geom_point(mapping=aes(x=TotalMinutesAsleep, y=VeryActiveMinutes, color="VeryActiveMinutes")) +
    geom_smooth(mapping=aes(x=TotalMinutesAsleep, y=VeryActiveMinutes, regLineColor="blue"))+
    labs(title="The Relationship Between Activity Levels and Total Minutes Asleep", x="Total Minutes Asleep", y="Minutes of Activity")

## Warning: Ignoring unknown aesthetics: regLineColor

ggplot(data=combined_data) +
    geom_point(mapping=aes(x=TotalMinutesAsleep, y=FairlyActiveMinutes, color="FairlyActiveMinutes")) +
    geom_smooth(mapping=aes(x=TotalMinutesAsleep, y=FairlyActiveMinutes, regLineColor="blue")) +
    labs(title="The Relationship Between Activity Levels and Total Minutes Asleep", x="Total Minutes Asleep", y="Minutes of Activity")

## Warning: Ignoring unknown aesthetics: regLineColor

ggplot(data=combined_data) +
    geom_point(mapping=aes(x=TotalMinutesAsleep, y=LightlyActiveMinutes, color="LightlyActiveMinutes")) +
    geom_smooth(mapping=aes(x=TotalMinutesAsleep, y=LightlyActiveMinutes, regLineColor="blue")) +
    labs(title="The Relationship Between Activity Levels and Total Minutes Asleep", x="Total Minutes Asleep", y="Minutes of Activity")

## Warning: Ignoring unknown aesthetics: regLineColor

ggplot(data=combined_data) +
    geom_point(mapping=aes(x=TotalMinutesAsleep, y=SedentaryMinutes, color="SedentaryMinutes")) +
    geom_smooth(mapping=aes(x=TotalMinutesAsleep, y=SedentaryMinutes, regLineColor="blue")) +
    labs(title="The Relationship Between Activity Levels and Total Minutes Asleep", x="Total Minutes Asleep", y="Minutes of Activity")

## Warning: Ignoring unknown aesthetics: regLineColor

I thought it would be interesting to map the Total Minutes Asleep against each activity level to see if there is a relationship. I hypothesized that the more sleep someone had, the more active they would be, and the less sleep someone had, the less active (more sedentary) they would be. However, this was not the case. Regardless of the quality of sleep, the average person was still sedentary. Surprisingly and strangely though, when people slept the most amount of minutes, they actually became the most sedentary!

5. Act

5.1 Revisiting Business Task

5.2 Trends Identified

On average, the average Total Steps per day for the participating individuals was 8053, which is almost 2000 steps below the suggested minimum Total Steps per day.
On average, 79% of total minutes per day were spent being sedentary by the participating individuals over the course of a month.
The participating individuals had an average BMI of 25.19, which puts them into the overweight category.
On average, the participating individuals slept slightly less than the suggested minimum of 7 hours of sleep.
The participating individuals were not consistent with logging/tracking their data each day over the course of the month, and some individuals did not even log/track their sleep or weight (only 24 unique users for sleep and eight for weight - where only two of these eight were made up the majority of the inputs).
The participating individuals did not lose weight, did not improve their BMI or sleep quality, and did not see any overall improvement in their activity levels.

5.3 Recommendations

Bellabeat should offer incentives for consistent tracking, like in-app competitions against friends or other users in the same city/state.
- Bellabeat could offer prizes or points, which can be redeemed for merchandise, exclusive access and discounts to future products, exclusive in-app features, or tickets for raffles.
- Bellabeat could offer greater incentives (i.e., more points) from Friday-Monday, since a lot of people lose motivation or consistency during this time of the week.
Bellabeat products should have a built in TDEE calculator, where the users are able to input their sex, age, weight, height, and other information to create accurate results.
- This calculator will notify the user of what their maintenance calories are (and their macros) and how much of a caloric deficit the user needs to be in each day to lose an X amount of lbs each week, based upon their weight goals and time frame.
- For example, someone needing 2000 maintenance calories will have to be in a 500 caloric deficit each day to lose about 1 lb of fat each week at a steady rate.
- The user would also be notified if they are reaching, have reached, or have passed their daily caloric intake.
- The user would also be notified when they’ve reached their suggested (or inputted) weight goal or are on the right track.
- The Bellabeat app can provide nutritional advice as part of its membership, which goes into detail about healthy recipes and managing macros.
- The app could have a list of activities and videos of said activities that people can do to burn some quick calories (since the average person is very sedentary and might not have a lot of time to spend hours in the gym, this would be a good incentive to exercise and burn a lot of calories in a short period of time).
- These could be quick instructional videos that showcase 30 minute, equipment free, exercises that burn a few hundred calories (i.e., crunches, jump rope, burpees, shadowboxing, HIIT, etc.).
Bellabeat products should be able to track sleep automatically, since users struggled to input their time consistently.
- Bellabeat could implement a Leaf or an app notification to notify the user of the ideal time to sleep as per the user’s schedule.
- The user would be notified an hour or two before bed to start winding down from using electronics that use blue light (it could even automatically switch the phone to night mode to prevent blue light exposure).
- The user would be notified about 30 minutes prior to bed time to take any Melatonin or sleep medication, if the user uses these.
- The user would be notified a few minutes before bed that it is time to get in bed and sleep - this should help the individual fall asleep in less than the current average 30 minutes, and would improve the overall length and quality of sleep.

5.4 Future Works

If this data set were to be collected again, I would like to see the following parameters met in order to create a flawless and in-depth analysis of this type of data:

Larger sample size that is more representative of the total user base (with a high confidence level ~95% and a low margin of error ~5%).
- Completely random with no bias during the selection of individuals.
- Ensure a longer data collection period, at least ~1 year.
- If there are any gaps in the data or days missed, an explanation would be necessary.
- More information on the individual, like their age, sex, height, etc.
- Individuals with extremely inconsistent logging/tracking habits should be excluded.
More current and up-to-date data, something from the past year.
Have an original (internal) data source, or at least have the primary/secondary data source be verified for integrity and credibility.

Google Data Analytics Capstone

Nicholas Peters

2022-03-14

Data Analytics Case Study

How Can a Wellness Technology Company Play It Smart?

Scenario

Bellabeat Products

1. Ask

2. Prepare

3. Analyse

5. Act