Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused smart products for women. Since it was founded in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women.By 2016, Bellabeat had opened offices around the world and launched multiple products. Bellabeat products became available through a growing number of online retailers in addition to their own e-commerce channel on their website. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. The founder believes that analyzing smart device fitness data could help unlock new growth opportunities for the company.
This project focuses on one of Bellabeat’s product, Bellabeat app which provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
Analyzing smart device usage data in order to identify trends based on how consumer uses non-Bellabeat smart device to get insight which ultimately help to draw a high-level recommendation for how these trends can inform Bellabeat marketing strategy.
The data set used for this case study is FitBit Fitness Tracker Data: Public Domain CCO data set made available through Mobius.
This data set may have its own limitation based on the following observation:-
It is not reliable since it contains small sample size; which may lead to sample bias and discrimination.
It is not original; it is a third-party data collected through distributed survey and made available for public use.
To say the data is comprehensive, it misses out some important detail about the nature of the sample size; it doesn’t specify the sampling method used to show how much of the population represented by the provided sample size, it doesn’t show any demographic detail like ages of the sample which could be helpful for gaining some useful insight and trends.
The data is collected in 2016 and last updated 2 years ago that means it is not current.
As the stakeholder also mentioned that the data has some limitation, this case study mainly focuses on finding initial insight by analyzing the available data to see if there is any trends in smart device usage. To get more reliable and high-level recommendation to inform the marketing strategy and business decision of the company, analyzing an up-to-date and credential data set is crucial.
For this case study
For this case study, the daily data and WeightLogInfo data will be used.
# install.packages("tidyverse")
# install.packages("dplyr")
# install.packages("bigrquery")
#install.packages("here")
#install.packages("skimr")
#install.packages("plotrix")
#install.packages("anytime")
library("tidyverse")
library("dplyr")
library("bigrquery")
library(lubridate)
library(readxl)
library(here)
library(ggplot2)
library(plotrix)
library(magrittr)
library(anytime)
con <- dbConnect(
bigrquery::bigquery(),
project = "bellabeat-case-study-374417",
dataset = "bellabeat_dataset")
dbListTables(con)
skeds <-dplyr::tbl(con, "daily_activity_merged" )
SELECT
*
FROM(
SELECT COUNT(DISTINCT Id) AS daily_activity_Id FROM `bellabeat_dataset.daily_activity_merged`)
CROSS JOIN
(SELECT COUNT (DISTINCT Id) AS daily_calories_Id FROM `bellabeat_dataset.dailyCalories_merged`)
CROSS JOIN
(SELECT COUNT(DISTINCT Id) AS daily_intensities_Id FROM `bellabeat_dataset.dailyIntensities_merged`)
CROSS JOIN
(SELECT COUNT(DISTINCT Id) AS daily_steps_Id FROM `bellabeat_dataset.dailySteps_merged`)
CROSS JOIN
(SELECT COUNT(DISTINCT Id) AS daily_sleep_Id FROM `bellabeat_dataset.sleepDay_merged`)
CROSS JOIN
(SELECT COUNT(DISTINCT Id) AS weight_Log_Info FROM `bellabeat_dataset.weightLogInfo_merged`)
| daily_activity_Id | daily_calories_Id | daily_intensities_Id | daily_steps_Id | daily_sleep_Id | weight_Log_Info |
|---|---|---|---|---|---|
| 33 | 33 | 33 | 33 | 25 | 9 |
The data set lacks consistency in-terms of the number of users, it could be because of that users prefer to use features of their interest in the app.
-- checking which column name used across the table
SELECT
column_name,
COUNT(table_name) AS shared_name
FROM
`bellabeat_dataset.INFORMATION_SCHEMA.COLUMNS`
GROUP BY 1
| column_name | shared_name |
|---|---|
| Id | 8 |
| ActivityDay | 3 |
| StepTotal | 2 |
| Date | 2 |
| WeightKg | 2 |
| WeightPounds | 2 |
| Fat | 2 |
| BMI | 2 |
| IsManualReport | 2 |
| LogId | 2 |
As we see from the result the column name “Id” is shared more than the other column names. To see if Id is in every table, let us run the following code:-
SELECT
table_name,
SUM(CASE
WHEN column_name = "Id" THEN 1
ELSE
0
END
) AS has_id_column
FROM
`bellabeat_dataset.INFORMATION_SCHEMA.COLUMNS`
GROUP BY
1
ORDER BY
1 ASC
| table_name | has_id_column |
|---|---|
| dailyCalories_merged | 1 |
| dailyIntensities_merged | 1 |
| dailySteps_merged | 1 |
| daily_activity_merged | 1 |
| merged_table | 1 |
| sleepDay_merged | 1 |
| weightLogInfo_merged | 1 |
| weight_log_info | 1 |
The following query checks to make sure that each table has date or time related column.
-- Checking date or time related column,
-- If column types were detected properly prior to upload this table should be empty
SELECT
table_name,
SUM(CASE
WHEN data_type IN ("TIMESTAMP", "DATETIME", "TIME", "DATE") THEN 1
ELSE
0
END
) AS has_time_info
FROM
`bellabeat_dataset.INFORMATION_SCHEMA.COLUMNS`
WHERE
data_type IN ("TIMESTAMP",
"DATETIME",
"DATE")
GROUP BY
1
HAVING
has_time_info = 0
| table_name | has_time_info |
|---|
Checking the name of DATETIME, TIMESTAMP, or DATE type columns
SELECT
CONCAT(table_catalog,".",table_schema,".",table_name) AS table_path,
table_name,
column_name
FROM
`bellabeat_dataset.INFORMATION_SCHEMA.COLUMNS`
WHERE
data_type IN ("TIMESTAMP",
"DATETIME",
"DATE")
| table_path | table_name | column_name |
|---|---|---|
| bellabeat-case-study-374417.bellabeat_dataset.dailySteps_merged | dailySteps_merged | ActivityDay |
| bellabeat-case-study-374417.bellabeat_dataset.weight_log_info | weight_log_info | Date |
| bellabeat-case-study-374417.bellabeat_dataset.daily_activity_merged | daily_activity_merged | ActivityDate |
| bellabeat-case-study-374417.bellabeat_dataset.dailyIntensities_merged | dailyIntensities_merged | ActivityDay |
| bellabeat-case-study-374417.bellabeat_dataset.dailyCalories_merged | dailyCalories_merged | ActivityDay |
| bellabeat-case-study-374417.bellabeat_dataset.merged_table | merged_table | ActivityDate |
If we find that not every table has a DATETIME, TIMESTAMP, or DATE column we use their names to check date-related keyword.
-- Here we check to see if the column name has any of the following keywords:
-- date, minute, daily, hourly, day, seconds
SELECT
table_name,
column_name
FROM
`bellabeat_dataset.INFORMATION_SCHEMA.COLUMNS`
WHERE
REGEXP_CONTAINS(LOWER(column_name), "date|minute|daily|hourly|day|seconds");
| table_name | column_name |
|---|---|
| dailySteps_merged | ActivityDay |
| weight_log_info | Date |
| daily_activity_merged | ActivityDate |
| daily_activity_merged | VeryActiveMinutes |
| daily_activity_merged | FairlyActiveMinutes |
| daily_activity_merged | LightlyActiveMinutes |
| daily_activity_merged | SedentaryMinutes |
| weightLogInfo_merged | Date |
| dailyIntensities_merged | ActivityDay |
| dailyIntensities_merged | SedentaryMinutes |
Using regular expression to see if ‘ActivityDate’ column in the ‘Daily_activity_merged’ table follows a particular pattern.
-- To quickly check if all columns follow the time stamp pattern we can take the minimum value of the Boolean expression across the entire table
SELECT
CASE
WHEN MIN(REGEXP_CONTAINS(STRING(ActivityDate), 'TIMESTAMP_REGEX')) = TRUE THEN "Valid"
ELSE
"Not Valid"
END
AS valid_test
FROM
`bellabeat_dataset.daily_activity_merged`;
| valid_test |
|---|
| Not Valid |
As this case study focuses on the day-level data, we need to find out which table contains daily data :-
SELECT
DISTINCT(table_name)
FROM
`bellabeat_dataset.INFORMATION_SCHEMA.COLUMNS`
WHERE
REGEXP_CONTAINS(LOWER(table_name),"day|daily");
| table_name |
|---|
| dailySteps_merged |
| daily_activity_merged |
| dailyIntensities_merged |
| dailyCalories_merged |
| sleepDay_merged |
Now that we have a list of tables we should look at the columns that are shared among the tables
SELECT
column_name,
data_type,
COUNT(table_name) AS table_count
FROM
`bellabeat_dataset.INFORMATION_SCHEMA.COLUMNS`
WHERE
REGEXP_CONTAINS(LOWER(table_name),"day|daily")
GROUP BY
1,
2;
| column_name | data_type | table_count |
|---|---|---|
| Id | INT64 | 4 |
| ActivityDay | DATE | 3 |
| StepTotal | INT64 | 1 |
| ActivityDate | DATE | 1 |
| TotalSteps | INT64 | 1 |
| TotalDistance | FLOAT64 | 1 |
| TrackerDistance | FLOAT64 | 1 |
| LoggedActivitiesDistance | FLOAT64 | 1 |
| VeryActiveDistance | FLOAT64 | 2 |
| ModeratelyActiveDistance | FLOAT64 | 2 |
Next, joining four table by making sure that the data type align between tables:- * Daily activity merged * Daily Calories merged * Daily intensities merged * Daily steps merged
--looking at the columns that are shared among the table
SELECT
column_name,
table_name,
data_type
FROM
`bellabeat_dataset.INFORMATION_SCHEMA.COLUMNS`
WHERE
REGEXP_CONTAINS(LOWER(table_name),"day|daily")
AND column_name IN (
SELECT
column_name
FROM
`bellabeat_dataset.INFORMATION_SCHEMA.COLUMNS`
WHERE
REGEXP_CONTAINS(LOWER(table_name),"day|daily")
GROUP BY
1
HAVING
COUNT(table_name) >=2)
ORDER BY
1;
SELECT
A.Id,
A.Calories,
* EXCEPT(Id,
Calories,
ActivityDay,
SedentaryMinutes,
LightlyActiveMinutes,
FairlyActiveMinutes,
VeryActiveMinutes,
SedentaryActiveDistance,
LightActiveDistance,
ModeratelyActiveDistance,
VeryActiveDistance),
I.SedentaryMinutes,
I.LightlyActiveMinutes,
I.FairlyActiveMinutes,
I.VeryActiveMinutes,
I.SedentaryActiveDistance,
I.LightActiveDistance,
I.ModeratelyActiveDistance,
I.VeryActiveDistance
FROM
`bellabeat_dataset.daily_activity_merged` A
LEFT JOIN
`bellabeat_dataset.dailyCalories_merged` C
ON
A.Id = C.Id
AND A.ActivityDate=C.ActivityDay
AND A.Calories = C.Calories
LEFT JOIN
`bellabeat_dataset.dailyIntensities_merged` I
ON
A.Id = I.Id
AND A.ActivityDate=I.ActivityDay
AND A.FairlyActiveMinutes = I.FairlyActiveMinutes
AND A.LightActiveDistance = I.LightActiveDistance
AND A.LightlyActiveMinutes = I.LightlyActiveMinutes
AND A.ModeratelyActiveDistance = I.ModeratelyActiveDistance
AND A.SedentaryActiveDistance = I.SedentaryActiveDistance
AND A.SedentaryMinutes = I.SedentaryMinutes
AND A.VeryActiveDistance = I.VeryActiveDistance
AND A.VeryActiveMinutes = I.VeryActiveMinutes
LEFT JOIN
`bellabeat_dataset.dailySteps_merged` S
ON
A.Id = S.Id
AND A.ActivityDate=S.ActivityDay
| Id | Calories | ActivityDate | TotalSteps | TotalDistance | TrackerDistance | LoggedActivitiesDistance | StepTotal | SedentaryMinutes | LightlyActiveMinutes | FairlyActiveMinutes | VeryActiveMinutes | SedentaryActiveDistance | LightActiveDistance | ModeratelyActiveDistance | VeryActiveDistance |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1624580081 | 2690 | 2016-05-01 | 36019 | 28.030001 | 28.030001 | 0 | 36019 | 1020 | 171 | 63 | 186 | 0.02 | 1.91 | 4.19 | 21.92 |
| 1644430081 | 3226 | 2016-04-14 | 11037 | 8.020001 | 8.020001 | 0 | 11037 | 1125 | 252 | 58 | 5 | 0.00 | 5.10 | 2.56 | 0.36 |
| 1644430081 | 3300 | 2016-04-19 | 11256 | 8.180000 | 8.180000 | 0 | 11256 | 1099 | 278 | 58 | 5 | 0.00 | 5.30 | 2.53 | 0.36 |
| 1644430081 | 3108 | 2016-04-28 | 9405 | 6.840000 | 6.840000 | 0 | 9405 | 1157 | 227 | 53 | 3 | 0.00 | 4.31 | 2.32 | 0.20 |
| 1644430081 | 3846 | 2016-04-30 | 18213 | 13.240000 | 13.240000 | 0 | 18213 | 816 | 402 | 71 | 9 | 0.00 | 9.46 | 3.14 | 0.63 |
| 1644430081 | 3324 | 2016-05-03 | 12850 | 9.340000 | 9.340000 | 0 | 12850 | 1115 | 221 | 94 | 10 | 0.00 | 4.54 | 4.09 | 0.72 |
| 2022484408 | 2897 | 2016-04-20 | 15112 | 10.670000 | 10.670000 | 0 | 15112 | 1053 | 276 | 63 | 48 | 0.00 | 5.40 | 1.93 | 3.34 |
| 2022484408 | 2709 | 2016-05-09 | 13379 | 9.390000 | 9.390000 | 0 | 13379 | 1061 | 297 | 47 | 35 | 0.00 | 5.64 | 1.63 | 2.12 |
| NA | 2010 | 2016-04-14 | 10129 | 6.700000 | 6.700000 | 0 | 10129 | 705 | 206 | 48 | 1 | 0.00 | 3.94 | 2.74 | 0.02 |
| NA | 2670 | 2016-04-16 | 22244 | 15.080000 | 15.080000 | 0 | 22244 | 968 | 268 | 72 | 66 | 0.00 | 5.53 | 4.10 | 5.45 |
merged_data <- read_xlsx(here("C:/Users/selam/Desktop/Case Study_Bellabeat/Merged_data.xlsx"))
head(merged_data)
## # A tibble: 6 × 16
## Id Activity_Date Calor…¹ Total…² Total…³ Track…⁴ Logge…⁵ StepT…⁶
## <dbl> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 8877689391 2016-04-22 00:00:00 3427 18258 16.3 16.3 0 18258
## 2 8877689391 2016-05-09 00:00:00 3710 20226 18.2 18.2 0 20226
## 3 6962181067 2016-04-21 00:00:00 2179 11835 9.71 7.88 4.08 11835
## 4 6962181067 2016-04-25 00:00:00 2194 13239 9.27 9.08 2.79 13239
## 5 4020332650 2016-05-05 00:00:00 3429 11728 8.43 8.43 0 11728
## 6 1644430081 2016-04-30 00:00:00 3846 18213 13.2 13.2 0 18213
## # … with 8 more variables: SedentaryMinutes <dbl>, LightlyActiveMinutes <dbl>,
## # FairlyActiveMinutes <dbl>, VeryActiveMinutes <dbl>,
## # SedentaryActiveDistance <dbl>, LightActiveDistance <dbl>,
## # ModeratelyActiveDistance <dbl>, VeryActiveDistance <dbl>, and abbreviated
## # variable names ¹Calories, ²TotalSteps, ³TotalDistance, ⁴TrackerDistance,
## # ⁵LoggedActivitiesDistance, ⁶StepTotal
sum(duplicated(merged_data))
## [1] 0
merged_data <- merged_data%>%
distinct() %>%
drop_na()
class(merged_data$Activity_Date)
## [1] "POSIXct" "POSIXt"
SleepDay_new <- read_csv(here("C:/Users/selam/Desktop/Case Study_Bellabeat/SleepDay_mergedNew.csv"))
head(SleepDay_new)
## # A tibble: 6 × 5
## Id SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## <dbl> <date> <dbl> <dbl> <dbl>
## 1 1503960366 2016-04-12 1 327 346
## 2 1503960366 2016-04-13 2 384 407
## 3 1503960366 2016-04-15 1 412 442
## 4 1503960366 2016-04-16 2 340 367
## 5 1503960366 2016-04-17 1 700 712
## 6 1503960366 2016-04-19 1 304 320
sum(duplicated(SleepDay_new))
## [1] 3
SleepDay_new <- SleepDay_new %>%
distinct() %>%
drop_na()
class(SleepDay_new$SleepDay)
## [1] "Date"
WeightLogInfo_new <- read_csv("C:/Users/selam/Desktop/Case Study_Bellabeat/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")
head(WeightLogInfo_new)
## # A tibble: 6 × 8
## Id Date WeightKg WeightPounds Fat BMI IsManualReport LogId
## <dbl> <date> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl>
## 1 1503960366 2016-05-02 52.6 116. 22 22.6 TRUE 1.46e12
## 2 1503960366 2016-05-03 52.6 116. NA 22.6 TRUE 1.46e12
## 3 1927972279 2016-04-13 134. 294. NA 47.5 FALSE 1.46e12
## 4 2873212765 2016-04-21 56.7 125. NA 21.5 TRUE 1.46e12
## 5 2873212765 2016-05-12 57.3 126. NA 21.7 TRUE 1.46e12
## 6 4319703577 2016-04-17 72.4 160. 25 27.5 TRUE 1.46e12
sum(duplicated(WeightLogInfo_new))
## [1] 0
class(WeightLogInfo_new$Date)
## [1] "Date"
Now that we have merged and cleaned data set, let’s join it all together for the analysis.
## First let's join daily activity merge with sleep day table
DailyActivity_Sleep <- merged_data %>%
full_join(SleepDay_new, by = c("Id", "Activity_Date" = "SleepDay"))
head(DailyActivity_Sleep)
## # A tibble: 6 × 19
## Id Activity_Date Calor…¹ Total…² Total…³ Track…⁴ Logge…⁵ StepT…⁶
## <dbl> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 8877689391 2016-04-22 00:00:00 3427 18258 16.3 16.3 0 18258
## 2 8877689391 2016-05-09 00:00:00 3710 20226 18.2 18.2 0 20226
## 3 6962181067 2016-04-21 00:00:00 2179 11835 9.71 7.88 4.08 11835
## 4 6962181067 2016-04-25 00:00:00 2194 13239 9.27 9.08 2.79 13239
## 5 4020332650 2016-05-05 00:00:00 3429 11728 8.43 8.43 0 11728
## 6 1644430081 2016-04-30 00:00:00 3846 18213 13.2 13.2 0 18213
## # … with 11 more variables: SedentaryMinutes <dbl>, LightlyActiveMinutes <dbl>,
## # FairlyActiveMinutes <dbl>, VeryActiveMinutes <dbl>,
## # SedentaryActiveDistance <dbl>, LightActiveDistance <dbl>,
## # ModeratelyActiveDistance <dbl>, VeryActiveDistance <dbl>,
## # TotalSleepRecords <dbl>, TotalMinutesAsleep <dbl>, TotalTimeInBed <dbl>,
## # and abbreviated variable names ¹Calories, ²TotalSteps, ³TotalDistance,
## # ⁴TrackerDistance, ⁵LoggedActivitiesDistance, ⁶StepTotal
## Now we can join daily activity sleep day data with Weight Log data
Joined_DailyActi_sleep_weight <- DailyActivity_Sleep %>%
full_join(WeightLogInfo_new, by = "Id")
glimpse(Joined_DailyActi_sleep_weight)
According to the metadata of FitBit dataset
We don’t have the demographic information of age group of the sample so let’s categorize our samples based on the total steps taken by users.
The goal of 10,000 steps is the recommended daily step target for healthy adults to achieve health benefits according to 10000steps.
Guideline on steps and activity levels:-
To categorize our sample, first we need to calculate the average calories, steps, sleep and weight by using the mean() function.
To do that, we can use merged daily activity, Sleep day and Weight Log data set.
## Average of Daily activity, step, sleep and weight
Ave <- Joined_DailyActi_sleep_weight %>%
group_by(Id)%>%
summarize(mean_cal = mean(Calories), mean_steps = mean(TotalSteps), mean_sleep = mean(TotalMinutesAsleep), mean_weight = mean(WeightPounds))
head(Ave)
## # A tibble: 6 × 5
## Id mean_cal mean_steps mean_sleep mean_weight
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1503960366 1816. 12117. NA 116.
## 2 1624580081 1483. 5744. NA NA
## 3 1644430081 2811. 7283. NA NA
## 4 1844505072 1573. 2580. NA NA
## 5 1927972279 2173. 916. NA 294.
## 6 2022484408 2510. 11371. NA NA
Now that we have aggregated data, we can categorize user types according to the above guideline.
## User Type👤
user_type <- Ave %>%
mutate(user_type = case_when
(mean_steps < 5000 ~ "Sedentary", mean_steps >= 5000 & mean_steps < 7499 ~ "Lightly Active", mean_steps >= 7500 & mean_steps < 9999 ~ "Fairly Active", mean_steps >=10000 ~ "Very Active"))
head(user_type)
## # A tibble: 6 × 6
## Id mean_cal mean_steps mean_sleep mean_weight user_type
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1503960366 1816. 12117. NA 116. Very Active
## 2 1624580081 1483. 5744. NA NA Lightly Active
## 3 1644430081 2811. 7283. NA NA Lightly Active
## 4 1844505072 1573. 2580. NA NA Sedentary
## 5 1927972279 2173. 916. NA 294. Sedentary
## 6 2022484408 2510. 11371. NA NA Very Active
→ Next we can create a data frame by calculating the percentage of each user type so that we can visualize the distribution among users.
Usertype_percentile <- user_type %>%
group_by(user_type) %>%
summarise(total = n()) %>%
mutate(totals =sum(total)) %>%
group_by(user_type) %>%
summarise(labels = scales::percent(total/totals))
Usertype_percentile$user_type <- factor(Usertype_percentile$user_type , labels = c("very active", "fairly active", "lightly active", "sedentary"))
head(Usertype_percentile)
## # A tibble: 4 × 2
## user_type labels
## <fct> <chr>
## 1 very active 27%
## 2 fairly active 27%
## 3 lightly active 24%
## 4 sedentary 21%
We can now visualize user type percentile.
Usertype_percentile %>%
ggplot(aes(x="",y=labels, fill=user_type)) +
geom_bar(stat = "identity", width = 1)+
coord_polar("y", start=0)+
theme_minimal()+
theme(axis.title.x= element_blank(),
axis.title.y = element_blank(),
panel.border = element_blank(),
panel.grid = element_blank(),
axis.ticks = element_blank(),
axis.text.x = element_blank(),
plot.title = element_text(hjust = 0.5, size=14, face = "bold"))+
scale_fill_manual(values = c("green","yellow", "orange", "red")) +
geom_text(aes(label = labels),
position = position_stack(vjust = 0.5))+
labs(title="User type percentiles")
→ User’s Sleeping behavior
We can check if there is any trends in user’s sleeping behavior based on the total steps taken by different types of users
Hypothesis: users who take more steps tends to sleep the desired amount of hour per day.
→ Let’s prove if the above hypothesis is true.
## User Type_2👤
user_type_2 <- Joined_DailyActi_sleep_weight %>%
mutate(user_type = case_when
(TotalSteps < 5000 ~ "Sedentary", TotalSteps >= 5000 & TotalSteps < 7499 ~ "Lightly Active", TotalSteps >= 7500 & TotalSteps < 9999 ~ "Fairly Active", TotalSteps >=10000 ~ "Very Active"))
head(user_type_2)
## # A tibble: 6 × 27
## Id Activity_Date Calor…¹ Total…² Total…³ Track…⁴ Logge…⁵ StepT…⁶
## <dbl> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 8877689391 2016-04-22 00:00:00 3427 18258 16.3 16.3 0 18258
## 2 8877689391 2016-04-22 00:00:00 3427 18258 16.3 16.3 0 18258
## 3 8877689391 2016-04-22 00:00:00 3427 18258 16.3 16.3 0 18258
## 4 8877689391 2016-04-22 00:00:00 3427 18258 16.3 16.3 0 18258
## 5 8877689391 2016-04-22 00:00:00 3427 18258 16.3 16.3 0 18258
## 6 8877689391 2016-04-22 00:00:00 3427 18258 16.3 16.3 0 18258
## # … with 19 more variables: SedentaryMinutes <dbl>, LightlyActiveMinutes <dbl>,
## # FairlyActiveMinutes <dbl>, VeryActiveMinutes <dbl>,
## # SedentaryActiveDistance <dbl>, LightActiveDistance <dbl>,
## # ModeratelyActiveDistance <dbl>, VeryActiveDistance <dbl>,
## # TotalSleepRecords <dbl>, TotalMinutesAsleep <dbl>, TotalTimeInBed <dbl>,
## # Date <date>, WeightKg <dbl>, WeightPounds <dbl>, Fat <dbl>, BMI <dbl>,
## # IsManualReport <lgl>, LogId <dbl>, user_type <chr>, and abbreviated …
ggplot(data = user_type_2, aes(x = TotalSteps, y = TotalMinutesAsleep))+
geom_point(aes(color = user_type, shape = user_type))+
facet_wrap(~user_type)+
labs(title = "User's sleeping behavior", element_text(hjust = 0.5))
As we can see from the above graph there is no significant change shown in user’s sleeping behavior among different types of users and steps taken.
Before we get into the analysis we know that only 9 users had data in the WeightLogInfo table so that the result from the analysis may not be accurate due to limited number of samples.
ggplot(data = user_type)+
geom_point(mapping = aes(x = mean_steps, y = mean_weight, color = user_type, shape = user_type))+
facet_wrap(~user_type)+
labs(title = "Steps and Weight", element_text(hjust = 0.5))
It can be inaccurate to say that users total steps and weight has a positive correlation, but as we can see from the above plot, we find the lowest average weight is in very active user categories and the highest is in Sedentary user categories.
Now let’s use Tableau visualization to see if there is any relationship between calorie burned, total steps and physical intensity.
From the above visualization:-
→ The scatter plot shows that there is a positive Correlation between calories and total steps.
→ The Correlation between physical intensity and calories is not perfectly positive; the variables don’t move together by the exact same percent and direction.
→ Users doesn’t usual logged into the app during their activities as it shows that the table contains almost no data in the logged activity distance column.
Let’s create groups of users based on the number of days that our samples used their devices through out the one month intervals.
→ High: users those used their device between 21 and 31 days, → Medium: users who uses their device for 11 to 20 days and, → Low: those are used their devises between 0 to 10 days.
Now we can create a data frame called ‘device usage’ based on the above categories:
device_usage <- DailyActivity_Sleep %>%
group_by(Id) %>%
summarise(numofdays_used = sum(n())) %>%
mutate(dev_usage = case_when(
numofdays_used >= 1 & numofdays_used <=10 ~ 'Low',
numofdays_used >= 11 & numofdays_used <= 20 ~ 'Medium',
numofdays_used >= 21 & numofdays_used <= 31 ~ 'High')) %>%
ungroup()
head(device_usage)
## # A tibble: 6 × 3
## Id numofdays_used dev_usage
## <dbl> <int> <chr>
## 1 1503960366 31 High
## 2 1624580081 31 High
## 3 1644430081 30 High
## 4 1844505072 31 High
## 5 1927972279 31 High
## 6 2022484408 31 High
→ Calculating device usage percentage
dev.usage_percent <- device_usage %>%
group_by(dev_usage) %>%
summarise(total = sum(n())) %>%
mutate(percentage = scales::percent(total/sum(total)))
head(dev.usage_percent)
## # A tibble: 3 × 3
## dev_usage total percentage
## <chr> <int> <chr>
## 1 High 29 87.9%
## 2 Low 1 3.0%
## 3 Medium 3 9.1%
pie3D(dev.usage_percent[[2]], labels = dev.usage_percent[[3]], main = "Device Usage Percentile", col = rainbow(3),radius = 1)
legend("topright", dev.usage_percent[[1]], cex = 0.5, fill = rainbow(length(dev.usage_percent[[1]])))
As we can see from the above analysis
→ 87.9% of users use their devices more often between 21 days and 31 days,
→ 3% of sample users use their device between 0 and 10 days which is low usage and,
→ 9.1% of users use their devices between 11 days and 21 days.
Now we can calculate the total minute that the device worn by users throughout the day by using the daily activity and intensity data and also to check if there is any trends we use device usage data frame.
First let’s merge daily activity table with device usage data frame.
daily_usage <- merged_data %>%
full_join(device_usage, by = c("Id"))
head(daily_usage)
## # A tibble: 6 × 18
## Id Activity_Date Calor…¹ Total…² Total…³ Track…⁴ Logge…⁵ StepT…⁶
## <dbl> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 8877689391 2016-04-22 00:00:00 3427 18258 16.3 16.3 0 18258
## 2 8877689391 2016-05-09 00:00:00 3710 20226 18.2 18.2 0 20226
## 3 6962181067 2016-04-21 00:00:00 2179 11835 9.71 7.88 4.08 11835
## 4 6962181067 2016-04-25 00:00:00 2194 13239 9.27 9.08 2.79 13239
## 5 4020332650 2016-05-05 00:00:00 3429 11728 8.43 8.43 0 11728
## 6 1644430081 2016-04-30 00:00:00 3846 18213 13.2 13.2 0 18213
## # … with 10 more variables: SedentaryMinutes <dbl>, LightlyActiveMinutes <dbl>,
## # FairlyActiveMinutes <dbl>, VeryActiveMinutes <dbl>,
## # SedentaryActiveDistance <dbl>, LightActiveDistance <dbl>,
## # ModeratelyActiveDistance <dbl>, VeryActiveDistance <dbl>,
## # numofdays_used <int>, dev_usage <chr>, and abbreviated variable names
## # ¹Calories, ²TotalSteps, ³TotalDistance, ⁴TrackerDistance,
## # ⁵LoggedActivitiesDistance, ⁶StepTotal
→ Now we can categorize users based on the total number of minute that users have their smart device on daily.
Full day
Half day and more
Less than half day
Let’s calculate total minute of daily smart device usage and percentage to assign into the above categories. For that we can use all the active minutes column and sedentary minute from the daily active table.
total_minu_usage <- daily_usage %>%
mutate(total_minute_wore = SedentaryMinutes + LightlyActiveMinutes + FairlyActiveMinutes + VeryActiveMinutes) %>%
mutate(percent_minu_wore = 100*(total_minute_wore/1440)) %>%
mutate(categories = case_when(
percent_minu_wore == 100 ~ "Full day", percent_minu_wore < 100 & percent_minu_wore > 50 ~ "Half day & more", percent_minu_wore < 50 ~ "Less than half day"
))
head(total_minu_usage)
## # A tibble: 6 × 21
## Id Activity_Date Calor…¹ Total…² Total…³ Track…⁴ Logge…⁵ StepT…⁶
## <dbl> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 8877689391 2016-04-22 00:00:00 3427 18258 16.3 16.3 0 18258
## 2 8877689391 2016-05-09 00:00:00 3710 20226 18.2 18.2 0 20226
## 3 6962181067 2016-04-21 00:00:00 2179 11835 9.71 7.88 4.08 11835
## 4 6962181067 2016-04-25 00:00:00 2194 13239 9.27 9.08 2.79 13239
## 5 4020332650 2016-05-05 00:00:00 3429 11728 8.43 8.43 0 11728
## 6 1644430081 2016-04-30 00:00:00 3846 18213 13.2 13.2 0 18213
## # … with 13 more variables: SedentaryMinutes <dbl>, LightlyActiveMinutes <dbl>,
## # FairlyActiveMinutes <dbl>, VeryActiveMinutes <dbl>,
## # SedentaryActiveDistance <dbl>, LightActiveDistance <dbl>,
## # ModeratelyActiveDistance <dbl>, VeryActiveDistance <dbl>,
## # numofdays_used <int>, dev_usage <chr>, total_minute_wore <dbl>,
## # percent_minu_wore <dbl>, categories <chr>, and abbreviated variable names
## # ¹Calories, ²TotalSteps, ³TotalDistance, ⁴TrackerDistance, …
Let’s have the percentage of the above three categories.
percent_total_minu <- total_minu_usage %>%
group_by(categories) %>%
summarise(total = n()) %>%
mutate(totals = sum(total)) %>%
group_by(categories) %>%
summarise(total_minu_percent = total / totals) %>%
mutate(labels = scales::percent(total_minu_percent))
head(percent_total_minu)
## # A tibble: 3 × 3
## categories total_minu_percent labels
## <chr> <dbl> <chr>
## 1 Full day 0.509 50.9%
## 2 Half day & more 0.465 46.5%
## 3 Less than half day 0.0266 2.7%
pie3D(percent_total_minu[[2]], labels = percent_total_minu[[3]], main = "Daily Smart Device Usage", explode = 0.1, radius = 1, col= rainbow(length(percent_total_minu)))
legend("topright", percent_total_minu[[1]], cex = 0.5, fill = rainbow(length(percent_total_minu[[1]])))
50.9% of users are the one who wore their device for full day,
46.5% of users wore their smart device for more than half day and,
2.7% of users wore their device less than a half day which is very few number of users.
We can now check if there is any trends in the number of days that the device used on monthly base and the total minute that the device worn on a daily base.
For this we use
→ Devise Usage:-
→ User categories based on the number of minute that they wore their smart device:-
ggplot(data = total_minu_usage)+
geom_bar(mapping = aes(x=categories, fill = categories ))+
facet_wrap(~dev_usage)+
labs(text = "Device Usage and Total Minute Worn")+
theme(axis.text.x = element_text(angle = 45))
As we can see from the above graph :-
The majority of users who wore their device for full day are under the category of High device users,
In the same way almost all of half day users also fall into the categories oh high device user’s categories
we can also say that medium and low device users wore their smart device less time throughout the day than high device users.
As we remember
High users are the one who uses their device between 21 days and 31 days,
Medium Users are users who uses their device between 11 to 20 days and,
Low users who uses their device less than 10 days in a month.
As a reminder the objective of this analysis is to identify trends on how consumer use non-Bellabeat devices by analyzing smart device usage to ultimately draw a high-level recommendation which will inform Bellabeat marketing strategy.
To achieve a more reliable and useful high-level recommendations analyzing Bellabeat’s own data is important. As the sample size, age group and reliability of FitBit data is in question, further analysis required using additional information and resources.
Based on my analysis I concluded the following;
→ Only 21% of users are in “Very Active” user category who takes the recommended amount of daily steps.
→ Users rarely logged into their App while they are active.
→ Among different variables in the datasets, only total steps taken and calories have a positive correlation.
→ Most of users not fully utilize all the features in the App which can benefit them in many ways for example WeightLog features.
→ 87.9% of users use their device more often during one month interval and 50.9% of users wore their smart device the whole time.
Based on the above conclusion the following recommendation drawn to help in planing Bellabeat’s future marketing strategies:
1. Engaging Users
For users to be more interactive with the Bellabeat App, we need to include more features which benefit and attract users; to name a few:-
→ A matching features for users to find a walking partners that way users can also encouraged to walk their recommended daily steps and
→ Different types of club for users to join in the App with other Bellabeat users.
Bellabeat App is first Woman’s health App with private key encryption which gives a confidence for users that their personal data is private.
In addition since the focus of Bellabeat is giving women the tools to live in harmony with themselves, it is important to continually promote all the useful features in the App and all products to help women reach their wellness goals.
2. Benefit And Reward
To encourage customers achieve their personal goal we can have a reward system that is when users achieve to a specified goal in a certain time period they could collect points, which ultimately can be redeemable in a form of discount, coupons, or Bellabeat product give away depending on The Wellness Score features of the App which shows step, activity, sleep, meditation & hydration progress throughout the day & night
3. Notification and Awareness
To get users more exited on receiving notification and useful awareness tips without making them annoyed and overwhelmed, we need to have a smart way to send a more personalized notifications, tips and encouragement on a daily base.
Ways that a Wellness Technology Company Play It Smart