Bellabeat is a high tech company that manufactures health-focused smart products for women. Their products are designed to help users live healthier lives. Bellabeat is a successful small company but it is seeking to tap into the opportunities available in the global market in order to grow its business. This study was carried to analyze trends in health-focused smart product market in order to unlock new opportunites for growth.
The business task is to analyze smart device usage data in order to gain insight into how users use these devices. Knowledge gained from this analysis is to be used to explore new opportunities to further grow the company.
The data to be used for this study was sourced from the CCO public domain. As such, it is data that is freely available to every researcher to use. The data contained in the domain consist of data on activity, sleep, calories, intensities, steps taken and weight logs. My approach to this study is to do a deep dive into the different aspects of the data with the intention of gaining more insights into how smart devices can be used to help guide the health choices of users.
install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("here")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("janitor")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("skimr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.1.8
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(here)
## here() starts at /cloud/project
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(skimr)
daily_activities <- read_csv("dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(daily_activities)
## # A tibble: 6 × 15
## Id Activ…¹ Total…² Total…³ Track…⁴ Logge…⁵ VeryA…⁶ Moder…⁷ Light…⁸ Seden…⁹
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2… 13162 8.5 8.5 0 1.88 0.550 6.06 0
## 2 1.50e9 4/13/2… 10735 6.97 6.97 0 1.57 0.690 4.71 0
## 3 1.50e9 4/14/2… 10460 6.74 6.74 0 2.44 0.400 3.91 0
## 4 1.50e9 4/15/2… 9762 6.28 6.28 0 2.14 1.26 2.83 0
## 5 1.50e9 4/16/2… 12669 8.16 8.16 0 2.71 0.410 5.04 0
## 6 1.50e9 4/17/2… 9705 6.48 6.48 0 3.19 0.780 2.51 0
## # … with 5 more variables: VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## # LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>, and
## # abbreviated variable names ¹ActivityDate, ²TotalSteps, ³TotalDistance,
## # ⁴TrackerDistance, ⁵LoggedActivitiesDistance, ⁶VeryActiveDistance,
## # ⁷ModeratelyActiveDistance, ⁸LightActiveDistance, ⁹SedentaryActiveDistance
Checking the number of participants available in the study
n_distinct(daily_activities$Id)
## [1] 33
nrow(daily_activities)
## [1] 940
The number of participants in the daily activities is 33. Statistically, this number is high enough for us to draw valid inferences from the data. Below is a quick summary of some observations in the study
daily_activities %>%
mutate(sedentary_hours = round(SedentaryMinutes/60, digits = 2)) %>%
select(TotalSteps, TotalDistance, sedentary_hours) %>%
summary()
## TotalSteps TotalDistance sedentary_hours
## Min. : 0 Min. : 0.000 Min. : 0.00
## 1st Qu.: 3790 1st Qu.: 2.620 1st Qu.:12.16
## Median : 7406 Median : 5.245 Median :17.62
## Mean : 7638 Mean : 5.490 Mean :16.52
## 3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.:20.49
## Max. :36019 Max. :28.030 Max. :24.00
The average steps taken by the participants daily is 7638 while the average number of hours users remain sedentary in a day is 16.52. The average sedentary hours is very high and users should be sensitized to reduce this as it could negatively impact their health.
Let us take closer look at users whose daily activities might expose them to higher health risks
daily_activeness <- daily_activities %>%
mutate(sedentary_hrs = round(SedentaryMinutes/60, digits = 2)) %>%
select(Id, TotalSteps, sedentary_hrs) %>%
group_by(Id) %>%
summarize(average_daily_steps = mean(TotalSteps), average_sedentary_hrs = mean(sedentary_hrs)) %>%
mutate(activeness = case_when(average_daily_steps < 4000 ~ "Low Daily Steps", average_daily_steps >= 4000 & average_daily_steps <= 6000 ~ " Fairly active", average_daily_steps > 6000 ~ "Highly active")) %>%
mutate(sedentariness = case_when(average_sedentary_hrs > 12 ~ "Highly sedentary", average_sedentary_hrs > 8 & average_sedentary_hrs <= 12 ~ "Averagely sedentary", average_sedentary_hrs < 8 ~ " Low sedentary"))
Let us preview the new dataframe just created
head(daily_activeness)
## # A tibble: 6 × 5
## Id average_daily_steps average_sedentary_hrs activeness seden…¹
## <dbl> <dbl> <dbl> <chr> <chr>
## 1 1503960366 12117. 14.1 "Highly active" Highly…
## 2 1624580081 5744. 21.0 " Fairly active" Highly…
## 3 1644430081 7283. 19.4 "Highly active" Highly…
## 4 1844505072 2580. 20.1 "Low Daily Steps" Highly…
## 5 1927972279 916. 22.0 "Low Daily Steps" Highly…
## 6 2022484408 11371. 18.5 "Highly active" Highly…
## # … with abbreviated variable name ¹sedentariness
Visualizing the different categories of activeness
ggplot(data = daily_activeness, aes(x = activeness, fill = activeness))+ geom_bar()+ labs(title = "Daily Activity", x = "Level of Daily Activity", y = "Number of Participants")
Visualizing the different categories of sedentariness
ggplot(data = daily_activeness, aes(x = sedentariness, fill = sedentariness))+ geom_bar()+ labs(title = "Daily Sedentary Lifestyle", x = "Level of Daily Sedentariness", y = "Number of Participants")
The two visualizations above clearly show that the daily habits of most participants exposes them to health risks especially as regards their sedentary lifestyle. Smart device users can be sensitized on the health risks posed by these daily habits.
Now we take a look at the data on calories and see what we can learn from there.
calories_data <- read_csv("dailyCalories_merged.csv")
## Rows: 940 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, Calories
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(calories_data)
## # A tibble: 6 × 3
## Id ActivityDay Calories
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 1985
## 2 1503960366 4/13/2016 1797
## 3 1503960366 4/14/2016 1776
## 4 1503960366 4/15/2016 1745
## 5 1503960366 4/16/2016 1863
## 6 1503960366 4/17/2016 1728
We will check the number of participants in the data to get some idea of the data’s suitability for analysis
n_distinct(calories_data$Id)
## [1] 33
nrow(calories_data)
## [1] 940
The number of participants is 33 which is high enough to make reasonable inferences from the data. We will now calculate some summary statistics of the data.
calories_data %>%
select(Calories) %>%
summary()
## Calories
## Min. : 0
## 1st Qu.:1828
## Median :2134
## Mean :2304
## 3rd Qu.:2793
## Max. :4900
Four of the data points on the calories column are zeroes and I believe this was because of non-response for that those data points. Because of the impact this extreme values could have on the results, I deem it necessary to remove the data points.
calories_data1 <- calories_data[-c(31, 654, 818, 880), ]
n_distinct(calories_data1$Id)
## [1] 33
nrow(calories_data1)
## [1] 936
We will now recalculate the summary statistics to get a better idea of information contained in the dataset.
calories_data1 %>%
select(Calories) %>%
summary()
## Calories
## Min. : 52
## 1st Qu.:1834
## Median :2144
## Mean :2313
## 3rd Qu.:2794
## Max. :4900
From the summary statistics above, the average daily calories burnt by participants is 2313. This falls within the recommended range for adults daily. We will now look at the average for each participant and get some idea what these figures look like.
average_daily_participant <- calories_data1 %>%
select(Id, Calories) %>%
group_by(Id) %>%
summarise(average_daily_calories = mean(Calories))
head(average_daily_participant)
## # A tibble: 6 × 2
## Id average_daily_calories
## <dbl> <dbl>
## 1 1503960366 1877.
## 2 1624580081 1483.
## 3 1644430081 2811.
## 4 1844505072 1573.
## 5 1927972279 2173.
## 6 2022484408 2510.
If we are to work with the recommended daily requirement of 2000 calories daily, a number of participants do not meet this requirement.
low_calories_category <- average_daily_participant %>%
filter(average_daily_calories < 2000)
head(low_calories_category)
## # A tibble: 6 × 2
## Id average_daily_calories
## <dbl> <dbl>
## 1 1503960366 1877.
## 2 1624580081 1483.
## 3 1844505072 1573.
## 4 2026352035 1541.
## 5 2320127002 1724.
## 6 2873212765 1917.
13 out of the total of 33 participants did not burn up to the required calories daily. This represents about 39 percent of the data.
We will now take a look at the sleep patterns of the participants in the study.
sleep_day <- read_csv("sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(sleep_day)
## # A tibble: 6 × 5
## Id SleepDay TotalSleepRecords TotalMinutesAsleep TotalT…¹
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 1 327 346
## 2 1503960366 4/13/2016 12:00:00 AM 2 384 407
## 3 1503960366 4/15/2016 12:00:00 AM 1 412 442
## 4 1503960366 4/16/2016 12:00:00 AM 2 340 367
## 5 1503960366 4/17/2016 12:00:00 AM 1 700 712
## 6 1503960366 4/19/2016 12:00:00 AM 1 304 320
## # … with abbreviated variable name ¹TotalTimeInBed
Let’s check the number of participants contained in this dataset
n_distinct(sleep_day$Id)
## [1] 24
nrow(sleep_day)
## [1] 413
The total number of participants is 24. Statistically, for a dataset to be suitable for analysis, the number of participants should be 30 and above. But we will still try to examine some of the patterns in this dataset.
sleep_day %>%
select(TotalMinutesAsleep, TotalTimeInBed) %>%
mutate(TotalHoursAsleep = TotalMinutesAsleep/60, TotalHoursInBed = TotalTimeInBed/60) %>%
summary()
## TotalMinutesAsleep TotalTimeInBed TotalHoursAsleep TotalHoursInBed
## Min. : 58.0 Min. : 61.0 Min. : 0.9667 Min. : 1.017
## 1st Qu.:361.0 1st Qu.:403.0 1st Qu.: 6.0167 1st Qu.: 6.717
## Median :433.0 Median :463.0 Median : 7.2167 Median : 7.717
## Mean :419.5 Mean :458.6 Mean : 6.9911 Mean : 7.644
## 3rd Qu.:490.0 3rd Qu.:526.0 3rd Qu.: 8.1667 3rd Qu.: 8.767
## Max. :796.0 Max. :961.0 Max. :13.2667 Max. :16.017
The average hours of sleep most of the participants get is 7.2. Twenty-five percent of the population only get six hours of sleep daily. This is below the recommended average of 8 hours daily given by health professionals. From these figures, we can infer that most of the participants are sleep deprived. We will now attempt to narrow this down to the participants in the study.
sleep_pattern <- sleep_day %>%
mutate(TotalHoursAsleep = TotalMinutesAsleep/60) %>%
select(Id, TotalMinutesAsleep, TotalHoursAsleep) %>%
group_by(Id) %>%
summarize(AverageHoursOfSleep = mean(TotalHoursAsleep)) %>%
mutate(HoursOfSleep = case_when(AverageHoursOfSleep < 8 ~ "Sleep Deprived", AverageHoursOfSleep >= 8 ~ "Sleeping Adequately"))
head(sleep_pattern)
## # A tibble: 6 × 3
## Id AverageHoursOfSleep HoursOfSleep
## <dbl> <dbl> <chr>
## 1 1503960366 6.00 Sleep Deprived
## 2 1644430081 4.9 Sleep Deprived
## 3 1844505072 10.9 Sleeping Adequately
## 4 1927972279 6.95 Sleep Deprived
## 5 2026352035 8.44 Sleeping Adequately
## 6 2320127002 1.02 Sleep Deprived
We will now visualize the sleep patterns among the participants
ggplot(data = sleep_pattern, aes(x = HoursOfSleep))+ geom_bar()+ labs(title = "Daily Hours of Sleep", x = "Hours Of Sleep", y = "Number of Participants")
From the visualizaion above, we can see that a large number of participants are not sleeping adequately. in fact, only two out of the 24 participants are getting adequate sleep. This represents only about 8 percent of the total participants.
We will now explore if the number of hours of sleep a participant gets has anything to do with the time spent in bed.
sleep_day %>%
select(TotalMinutesAsleep, TotalTimeInBed) %>%
summarize(correlation = cor(TotalMinutesAsleep, TotalTimeInBed))
## # A tibble: 1 × 1
## correlation
## <dbl>
## 1 0.930
ggplot(data = sleep_day, aes(x = TotalMinutesAsleep, y = TotalTimeInBed))+ geom_point()+ labs(title = "Sleep: Total Minutes Asleep vs Total Time In Bed", x = "Total Minutes Asleep", y = "Total Time In Bed" )
The positive relationship shown between time spent in bed and hours of sleep and the high correlation of 0.93 between the two strongly suggests that people who spend more time in bed tend to get more sleep.
We will now take a look at the heart rate data and see what we can learn from there.
heartrate_data <- read_csv("heartrate_seconds_merged.csv")
## Rows: 2483658 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Time
## dbl (2): Id, Value
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(heartrate_data)
## # A tibble: 6 × 3
## Id Time Value
## <dbl> <chr> <dbl>
## 1 2022484408 4/12/2016 7:21:00 AM 97
## 2 2022484408 4/12/2016 7:21:05 AM 102
## 3 2022484408 4/12/2016 7:21:10 AM 105
## 4 2022484408 4/12/2016 7:21:20 AM 103
## 5 2022484408 4/12/2016 7:21:25 AM 101
## 6 2022484408 4/12/2016 7:22:05 AM 95
Let us check out the number of participants in this data as well as the size of the entire data
n_distinct(heartrate_data$Id)
## [1] 14
nrow(heartrate_data)
## [1] 2483658
The number of participants in the data is 14 which is very low. This will make it difficult for us to draw valid inferences from the result of our analysis of this data. It will however be interesting to know what the data has to tell us about this 14 participants.
heartrate_cat <- heartrate_data %>%
select(Id, Value) %>%
group_by(Id) %>%
summarize(average_heartrate_per_minute = mean(Value)) %>%
mutate(heartrate_category = case_when(average_heartrate_per_minute < 60 ~ "At Risk of Bradycardia", average_heartrate_per_minute >= 60 & average_heartrate_per_minute <= 100 ~ " Normal Heartrate", average_heartrate_per_minute > 100 ~ "At Risk of Tachycardia"))
head(heartrate_cat)
## # A tibble: 6 × 3
## Id average_heartrate_per_minute heartrate_category
## <dbl> <dbl> <chr>
## 1 2022484408 80.2 " Normal Heartrate"
## 2 2026352035 93.8 " Normal Heartrate"
## 3 2347167796 76.7 " Normal Heartrate"
## 4 4020332650 82.3 " Normal Heartrate"
## 5 4388161847 66.1 " Normal Heartrate"
## 6 4558609924 81.7 " Normal Heartrate"
The heart rates of all 14 participants fall within the normal range.
We will also take a look at the data on weights of each of the participants
weight_data <- read_csv("weightLogInfo_merged.csv")
## Rows: 67 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
## lgl (1): IsManualReport
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(weight_data)
## # A tibble: 6 × 8
## Id Date WeightKg Weight…¹ Fat BMI IsMan…² LogId
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl>
## 1 1503960366 5/2/2016 11:59:59 PM 52.6 116. 22 22.6 TRUE 1.46e12
## 2 1503960366 5/3/2016 11:59:59 PM 52.6 116. NA 22.6 TRUE 1.46e12
## 3 1927972279 4/13/2016 1:08:52 AM 134. 294. NA 47.5 FALSE 1.46e12
## 4 2873212765 4/21/2016 11:59:59 PM 56.7 125. NA 21.5 TRUE 1.46e12
## 5 2873212765 5/12/2016 11:59:59 PM 57.3 126. NA 21.7 TRUE 1.46e12
## 6 4319703577 4/17/2016 11:59:59 PM 72.4 160. 25 27.5 TRUE 1.46e12
## # … with abbreviated variable names ¹WeightPounds, ²IsManualReport
Checking the number participants and data size
n_distinct(weight_data$Id)
## [1] 8
nrow(weight_data)
## [1] 67
The number of participants in the dataset is too low for us to make any meaningful inferences from the result of our analysis of this data.
A successful marketing campaign relies on matching the products you have to offer with the needs of prospective customers. The key takeaways above reveal the classes of individuals who can take advantage of the products Bellabeat have to offer to improve their health and make better health choices. I therefore recommend that:
The key takeaways should be woven into the marketing campaign as a way of demonstrating to customers the practical utility of Bellabeat products.
Given that these products are marketed primarily to women and they are expected to be worn for a long time, aesthetics and comfort should be seriously considered in the design of the products.