1. Introduction

Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused smart products for women. Since it was founded in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women.By 2016, Bellabeat had opened offices around the world and launched multiple products. Bellabeat products became available through a growing number of online retailers in addition to their own e-commerce channel on their website. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. The founder believes that analyzing smart device fitness data could help unlock new growth opportunities for the company.

Products
  • Bellabeat app
  • Leaf
  • Time
  • Spring
  • Bellabeat membership

This project focuses on one of Bellabeat’s product, Bellabeat app which provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.

2. Ask

2.1. Business Task

Analyzing smart device usage data in order to identify trends based on how consumer uses non-Bellabeat smart device to get insight which ultimately help to draw a high-level recommendation for how these trends can inform Bellabeat marketing strategy.

  • What are some trends in smart device usage?
  • How could these trends apply to Bellabeat customers?
  • How these trends could help influence Bellabeat marketing strategy?

2.2. Stakeholders

  • Urška Sršen: Bellabeat’s co-founder and Chief Creative Officer
  • Sando Mur: Mathematician and Bellabeat’s co-founder; key member of the Bellabeat executive team
  • Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy.

3. Prepare

3.1 Data set Used

The data set used for this case study is FitBit Fitness Tracker Data: Public Domain CCO data set made available through Mobius.

3.2 About the Data

  • The data set organized in long format where each subject have data in multiple rows, it is also composed of quantitative data and contains 18 csv files presented in spreadsheet.
  • The data located in Kaggle.
  • It contains personal fitness tracker from 30 FitBit eligible users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring.
  • The data set collected through distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016.
  • Individual reports can be parsed by export session ID (column A) or timestamp (column B). Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors / preferences.

3.3. Credibility and Integrity of the Data set

This data set may have its own limitation based on the following observation:-

  • It is not reliable since it contains small sample size; which may lead to sample bias and discrimination.

  • It is not original; it is a third-party data collected through distributed survey and made available for public use.

  • To say the data is comprehensive, it misses out some important detail about the nature of the sample size; it doesn’t specify the sampling method used to show how much of the population represented by the provided sample size, it doesn’t show any demographic detail like ages of the sample which could be helpful for gaining some useful insight and trends.

  • The data is collected in 2016 and last updated 2 years ago that means it is not current.

As the stakeholder also mentioned that the data has some limitation, this case study mainly focuses on finding initial insight by analyzing the available data to see if there is any trends in smart device usage. To get more reliable and high-level recommendation to inform the marketing strategy and business decision of the company, analyzing an up-to-date and credential data set is crucial.

4. Process

4.1 Tools

For this case study

  • Excel and SQL-BigQuery:- used for cleaning, aggregating, organizing and analyzing the dataset.
  • R:- used for more analysis and visualization.
  • Tableau: used for data visualization.

For this case study, the daily data and WeightLogInfo data will be used.

4.2. Data Cleaning

4.2.1. Installing Packages

# install.packages("tidyverse")
# install.packages("dplyr")
# install.packages("bigrquery")
#install.packages("here")
#install.packages("skimr")
#install.packages("plotrix")
#install.packages("anytime")
library("tidyverse")
library("dplyr")
library("bigrquery")
library(lubridate)
library(readxl)
library(here)
library(ggplot2)
library(plotrix)
library(magrittr)
library(anytime)
con <- dbConnect(
  bigrquery::bigquery(),
  project = "bellabeat-case-study-374417",
  dataset = "bellabeat_dataset")

dbListTables(con)

skeds <-dplyr::tbl(con, "daily_activity_merged" )
Counting distinct user Id in the data set Using SQL query
SELECT
*
FROM(
  SELECT COUNT(DISTINCT Id) AS daily_activity_Id FROM `bellabeat_dataset.daily_activity_merged`)

CROSS JOIN
(SELECT COUNT (DISTINCT Id) AS daily_calories_Id FROM `bellabeat_dataset.dailyCalories_merged`)
CROSS JOIN
(SELECT COUNT(DISTINCT Id) AS daily_intensities_Id FROM `bellabeat_dataset.dailyIntensities_merged`)
CROSS JOIN
(SELECT COUNT(DISTINCT Id) AS daily_steps_Id FROM `bellabeat_dataset.dailySteps_merged`)
CROSS JOIN
(SELECT COUNT(DISTINCT Id) AS daily_sleep_Id FROM `bellabeat_dataset.sleepDay_merged`)
CROSS JOIN
(SELECT COUNT(DISTINCT Id) AS weight_Log_Info FROM `bellabeat_dataset.weightLogInfo_merged`)
1 records
daily_activity_Id daily_calories_Id daily_intensities_Id daily_steps_Id daily_sleep_Id weight_Log_Info
33 33 33 33 25 9

The data set lacks consistency in-terms of the number of users, it could be because of that users prefer to use features of their interest in the app.

4.2.2. Exploring the data set

-- checking which column name used across the table
SELECT
column_name,
 COUNT(table_name) AS shared_name
FROM
 `bellabeat_dataset.INFORMATION_SCHEMA.COLUMNS`
 GROUP BY 1
Displaying records 1 - 10
column_name shared_name
Id 8
ActivityDay 3
StepTotal 2
Date 2
WeightKg 2
WeightPounds 2
Fat 2
BMI 2
IsManualReport 2
LogId 2

As we see from the result the column name “Id” is shared more than the other column names. To see if Id is in every table, let us run the following code:-

SELECT
 table_name,
 SUM(CASE
     WHEN column_name = "Id" THEN 1
   ELSE
   0
 END
   ) AS has_id_column
FROM
 `bellabeat_dataset.INFORMATION_SCHEMA.COLUMNS`
GROUP BY
 1
ORDER BY
 1 ASC
8 records
table_name has_id_column
dailyCalories_merged 1
dailyIntensities_merged 1
dailySteps_merged 1
daily_activity_merged 1
merged_table 1
sleepDay_merged 1
weightLogInfo_merged 1
weight_log_info 1

The following query checks to make sure that each table has date or time related column.

-- Checking date or time related column, 
-- If column types were detected properly prior to upload this table should be empty
SELECT
 table_name,
 SUM(CASE
     WHEN data_type IN ("TIMESTAMP", "DATETIME", "TIME", "DATE") THEN 1
   ELSE
   0
 END
   ) AS has_time_info
FROM
 `bellabeat_dataset.INFORMATION_SCHEMA.COLUMNS`
WHERE
 data_type IN ("TIMESTAMP",
   "DATETIME",
   "DATE")
GROUP BY
 1
HAVING
 has_time_info = 0
0 records
table_name has_time_info

Checking the name of DATETIME, TIMESTAMP, or DATE type columns

SELECT
 CONCAT(table_catalog,".",table_schema,".",table_name) AS table_path,
 table_name,
 column_name
FROM
`bellabeat_dataset.INFORMATION_SCHEMA.COLUMNS`
WHERE
 data_type IN ("TIMESTAMP",
   "DATETIME",
   "DATE")
6 records
table_path table_name column_name
bellabeat-case-study-374417.bellabeat_dataset.dailySteps_merged dailySteps_merged ActivityDay
bellabeat-case-study-374417.bellabeat_dataset.weight_log_info weight_log_info Date
bellabeat-case-study-374417.bellabeat_dataset.daily_activity_merged daily_activity_merged ActivityDate
bellabeat-case-study-374417.bellabeat_dataset.dailyIntensities_merged dailyIntensities_merged ActivityDay
bellabeat-case-study-374417.bellabeat_dataset.dailyCalories_merged dailyCalories_merged ActivityDay
bellabeat-case-study-374417.bellabeat_dataset.merged_table merged_table ActivityDate

If we find that not every table has a DATETIME, TIMESTAMP, or DATE column we use their names to check date-related keyword.


 -- Here we check to see if the column name has any of the following keywords:
 -- date, minute, daily, hourly, day, seconds
SELECT
 table_name,
 column_name
FROM
 `bellabeat_dataset.INFORMATION_SCHEMA.COLUMNS`
WHERE
 REGEXP_CONTAINS(LOWER(column_name), "date|minute|daily|hourly|day|seconds");
Displaying records 1 - 10
table_name column_name
dailySteps_merged ActivityDay
weight_log_info Date
daily_activity_merged ActivityDate
daily_activity_merged VeryActiveMinutes
daily_activity_merged FairlyActiveMinutes
daily_activity_merged LightlyActiveMinutes
daily_activity_merged SedentaryMinutes
weightLogInfo_merged Date
dailyIntensities_merged ActivityDay
dailyIntensities_merged SedentaryMinutes

Using regular expression to see if ‘ActivityDate’ column in the ‘Daily_activity_merged’ table follows a particular pattern.

 -- To quickly check if all columns follow the time stamp pattern we can take the minimum value of the Boolean expression across the entire table
SELECT
 CASE
   WHEN MIN(REGEXP_CONTAINS(STRING(ActivityDate), 'TIMESTAMP_REGEX')) = TRUE THEN "Valid"
 ELSE
 "Not Valid"
END
 AS valid_test
FROM
 `bellabeat_dataset.daily_activity_merged`;
1 records
valid_test
Not Valid

As this case study focuses on the day-level data, we need to find out which table contains daily data :-

SELECT
DISTINCT(table_name)
FROM
`bellabeat_dataset.INFORMATION_SCHEMA.COLUMNS`
WHERE
REGEXP_CONTAINS(LOWER(table_name),"day|daily");
5 records
table_name
dailySteps_merged
daily_activity_merged
dailyIntensities_merged
dailyCalories_merged
sleepDay_merged

Now that we have a list of tables we should look at the columns that are shared among the tables

SELECT
 column_name,
 data_type,
 COUNT(table_name) AS table_count
FROM
 `bellabeat_dataset.INFORMATION_SCHEMA.COLUMNS`
WHERE
 REGEXP_CONTAINS(LOWER(table_name),"day|daily")
GROUP BY
 1,
 2;
Displaying records 1 - 10
column_name data_type table_count
Id INT64 4
ActivityDay DATE 3
StepTotal INT64 1
ActivityDate DATE 1
TotalSteps INT64 1
TotalDistance FLOAT64 1
TrackerDistance FLOAT64 1
LoggedActivitiesDistance FLOAT64 1
VeryActiveDistance FLOAT64 2
ModeratelyActiveDistance FLOAT64 2

4.3. More cleaning, Merging and Importing data

Next, joining four table by making sure that the data type align between tables:- * Daily activity merged * Daily Calories merged * Daily intensities merged * Daily steps merged

--looking at the columns that are shared among the table

SELECT
 column_name,
 table_name,
 data_type
FROM
 `bellabeat_dataset.INFORMATION_SCHEMA.COLUMNS`
WHERE
 REGEXP_CONTAINS(LOWER(table_name),"day|daily")
 AND column_name IN (
 SELECT
   column_name
 FROM
   `bellabeat_dataset.INFORMATION_SCHEMA.COLUMNS`
 WHERE
   REGEXP_CONTAINS(LOWER(table_name),"day|daily")
 GROUP BY
   1
 HAVING
   COUNT(table_name) >=2)
ORDER BY
 1;
SELECT
 A.Id,
 A.Calories,
 * EXCEPT(Id,
   Calories,
   ActivityDay,
   SedentaryMinutes,
   LightlyActiveMinutes,
   FairlyActiveMinutes,
   VeryActiveMinutes,
   SedentaryActiveDistance,
   LightActiveDistance,
   ModeratelyActiveDistance,
   VeryActiveDistance),
 I.SedentaryMinutes,
 I.LightlyActiveMinutes,
 I.FairlyActiveMinutes,
 I.VeryActiveMinutes,
 I.SedentaryActiveDistance,
 I.LightActiveDistance,
 I.ModeratelyActiveDistance,
 I.VeryActiveDistance
FROM
 `bellabeat_dataset.daily_activity_merged` A
LEFT JOIN
 `bellabeat_dataset.dailyCalories_merged` C
ON
 A.Id = C.Id
 AND A.ActivityDate=C.ActivityDay
 AND A.Calories = C.Calories
LEFT JOIN
 `bellabeat_dataset.dailyIntensities_merged` I
ON
 A.Id = I.Id
 AND A.ActivityDate=I.ActivityDay
 AND A.FairlyActiveMinutes = I.FairlyActiveMinutes
 AND A.LightActiveDistance = I.LightActiveDistance
 AND A.LightlyActiveMinutes = I.LightlyActiveMinutes
 AND A.ModeratelyActiveDistance = I.ModeratelyActiveDistance
 AND A.SedentaryActiveDistance = I.SedentaryActiveDistance
 AND A.SedentaryMinutes = I.SedentaryMinutes
 AND A.VeryActiveDistance = I.VeryActiveDistance
 AND A.VeryActiveMinutes = I.VeryActiveMinutes
LEFT JOIN
 `bellabeat_dataset.dailySteps_merged` S
ON
 A.Id = S.Id
 AND A.ActivityDate=S.ActivityDay
Displaying records 1 - 10
Id Calories ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitiesDistance StepTotal SedentaryMinutes LightlyActiveMinutes FairlyActiveMinutes VeryActiveMinutes SedentaryActiveDistance LightActiveDistance ModeratelyActiveDistance VeryActiveDistance
1624580081 2690 2016-05-01 36019 28.030001 28.030001 0 36019 1020 171 63 186 0.02 1.91 4.19 21.92
1644430081 3226 2016-04-14 11037 8.020001 8.020001 0 11037 1125 252 58 5 0.00 5.10 2.56 0.36
1644430081 3300 2016-04-19 11256 8.180000 8.180000 0 11256 1099 278 58 5 0.00 5.30 2.53 0.36
1644430081 3108 2016-04-28 9405 6.840000 6.840000 0 9405 1157 227 53 3 0.00 4.31 2.32 0.20
1644430081 3846 2016-04-30 18213 13.240000 13.240000 0 18213 816 402 71 9 0.00 9.46 3.14 0.63
1644430081 3324 2016-05-03 12850 9.340000 9.340000 0 12850 1115 221 94 10 0.00 4.54 4.09 0.72
2022484408 2897 2016-04-20 15112 10.670000 10.670000 0 15112 1053 276 63 48 0.00 5.40 1.93 3.34
2022484408 2709 2016-05-09 13379 9.390000 9.390000 0 13379 1061 297 47 35 0.00 5.64 1.63 2.12
NA 2010 2016-04-14 10129 6.700000 6.700000 0 10129 705 206 48 1 0.00 3.94 2.74 0.02
NA 2670 2016-04-16 22244 15.080000 15.080000 0 22244 968 268 72 66 0.00 5.53 4.10 5.45

4.3.1. Importing and cleaning datasets

merged_data <- read_xlsx(here("C:/Users/selam/Desktop/Case Study_Bellabeat/Merged_data.xlsx"))
head(merged_data)
## # A tibble: 6 × 16
##           Id Activity_Date       Calor…¹ Total…² Total…³ Track…⁴ Logge…⁵ StepT…⁶
##        <dbl> <dttm>                <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 8877689391 2016-04-22 00:00:00    3427   18258   16.3    16.3     0      18258
## 2 8877689391 2016-05-09 00:00:00    3710   20226   18.2    18.2     0      20226
## 3 6962181067 2016-04-21 00:00:00    2179   11835    9.71    7.88    4.08   11835
## 4 6962181067 2016-04-25 00:00:00    2194   13239    9.27    9.08    2.79   13239
## 5 4020332650 2016-05-05 00:00:00    3429   11728    8.43    8.43    0      11728
## 6 1644430081 2016-04-30 00:00:00    3846   18213   13.2    13.2     0      18213
## # … with 8 more variables: SedentaryMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, VeryActiveMinutes <dbl>,
## #   SedentaryActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, VeryActiveDistance <dbl>, and abbreviated
## #   variable names ¹​Calories, ²​TotalSteps, ³​TotalDistance, ⁴​TrackerDistance,
## #   ⁵​LoggedActivitiesDistance, ⁶​StepTotal
sum(duplicated(merged_data))
## [1] 0
merged_data <- merged_data%>% 
 distinct() %>% 
  drop_na()
class(merged_data$Activity_Date)
## [1] "POSIXct" "POSIXt"
SleepDay_new <- read_csv(here("C:/Users/selam/Desktop/Case Study_Bellabeat/SleepDay_mergedNew.csv"))
head(SleepDay_new)
## # A tibble: 6 × 5
##           Id SleepDay   TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
##        <dbl> <date>                 <dbl>              <dbl>          <dbl>
## 1 1503960366 2016-04-12                 1                327            346
## 2 1503960366 2016-04-13                 2                384            407
## 3 1503960366 2016-04-15                 1                412            442
## 4 1503960366 2016-04-16                 2                340            367
## 5 1503960366 2016-04-17                 1                700            712
## 6 1503960366 2016-04-19                 1                304            320
sum(duplicated(SleepDay_new))
## [1] 3
SleepDay_new <- SleepDay_new %>% 
 distinct() %>% 
  drop_na()
class(SleepDay_new$SleepDay)
## [1] "Date"
WeightLogInfo_new <- read_csv("C:/Users/selam/Desktop/Case Study_Bellabeat/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")
head(WeightLogInfo_new)
## # A tibble: 6 × 8
##           Id Date       WeightKg WeightPounds   Fat   BMI IsManualReport   LogId
##        <dbl> <date>        <dbl>        <dbl> <dbl> <dbl> <lgl>            <dbl>
## 1 1503960366 2016-05-02     52.6         116.    22  22.6 TRUE           1.46e12
## 2 1503960366 2016-05-03     52.6         116.    NA  22.6 TRUE           1.46e12
## 3 1927972279 2016-04-13    134.          294.    NA  47.5 FALSE          1.46e12
## 4 2873212765 2016-04-21     56.7         125.    NA  21.5 TRUE           1.46e12
## 5 2873212765 2016-05-12     57.3         126.    NA  21.7 TRUE           1.46e12
## 6 4319703577 2016-04-17     72.4         160.    25  27.5 TRUE           1.46e12
sum(duplicated(WeightLogInfo_new))
## [1] 0
class(WeightLogInfo_new$Date)
## [1] "Date"

4.3.2. Joining data sets

Now that we have merged and cleaned data set, let’s join it all together for the analysis.

## First let's join daily activity merge with sleep day table
DailyActivity_Sleep <- merged_data %>%  
  full_join(SleepDay_new, by = c("Id", "Activity_Date" = "SleepDay"))
head(DailyActivity_Sleep)
## # A tibble: 6 × 19
##           Id Activity_Date       Calor…¹ Total…² Total…³ Track…⁴ Logge…⁵ StepT…⁶
##        <dbl> <dttm>                <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 8877689391 2016-04-22 00:00:00    3427   18258   16.3    16.3     0      18258
## 2 8877689391 2016-05-09 00:00:00    3710   20226   18.2    18.2     0      20226
## 3 6962181067 2016-04-21 00:00:00    2179   11835    9.71    7.88    4.08   11835
## 4 6962181067 2016-04-25 00:00:00    2194   13239    9.27    9.08    2.79   13239
## 5 4020332650 2016-05-05 00:00:00    3429   11728    8.43    8.43    0      11728
## 6 1644430081 2016-04-30 00:00:00    3846   18213   13.2    13.2     0      18213
## # … with 11 more variables: SedentaryMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, VeryActiveMinutes <dbl>,
## #   SedentaryActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, VeryActiveDistance <dbl>,
## #   TotalSleepRecords <dbl>, TotalMinutesAsleep <dbl>, TotalTimeInBed <dbl>,
## #   and abbreviated variable names ¹​Calories, ²​TotalSteps, ³​TotalDistance,
## #   ⁴​TrackerDistance, ⁵​LoggedActivitiesDistance, ⁶​StepTotal
## Now we can join daily activity sleep day data with Weight Log data
Joined_DailyActi_sleep_weight <- DailyActivity_Sleep %>%
  full_join(WeightLogInfo_new, by = "Id")
glimpse(Joined_DailyActi_sleep_weight)

5. Analyze

According to the metadata of FitBit dataset

5.1. Types of Users

We don’t have the demographic information of age group of the sample so let’s categorize our samples based on the total steps taken by users.

The goal of 10,000 steps is the recommended daily step target for healthy adults to achieve health benefits according to 10000steps.

Guideline on steps and activity levels:-

  • Sedentary is less than 5,000 steps per day
  • Low active is 5,000 to 7,499 steps per day
  • Somewhat active is 7,500 to 9,999 steps per day
  • Active is more than 10,000 steps per day
  • Highly active is more than 12,500

5.2. Aggregating Data For Calculation

To categorize our sample, first we need to calculate the average calories, steps, sleep and weight by using the mean() function.

To do that, we can use merged daily activity, Sleep day and Weight Log data set.

## Average of Daily activity, step, sleep and weight 
Ave <- Joined_DailyActi_sleep_weight %>%
  group_by(Id)%>%
  summarize(mean_cal = mean(Calories), mean_steps = mean(TotalSteps), mean_sleep = mean(TotalMinutesAsleep), mean_weight = mean(WeightPounds))
head(Ave)
## # A tibble: 6 × 5
##           Id mean_cal mean_steps mean_sleep mean_weight
##        <dbl>    <dbl>      <dbl>      <dbl>       <dbl>
## 1 1503960366    1816.     12117.         NA        116.
## 2 1624580081    1483.      5744.         NA         NA 
## 3 1644430081    2811.      7283.         NA         NA 
## 4 1844505072    1573.      2580.         NA         NA 
## 5 1927972279    2173.       916.         NA        294.
## 6 2022484408    2510.     11371.         NA         NA

Now that we have aggregated data, we can categorize user types according to the above guideline.

## User Type👤 
user_type <- Ave %>% 
  mutate(user_type = case_when
         (mean_steps < 5000 ~ "Sedentary", mean_steps >= 5000 & mean_steps < 7499 ~ "Lightly Active", mean_steps >= 7500 & mean_steps < 9999 ~ "Fairly Active", mean_steps >=10000 ~ "Very Active"))
head(user_type)
## # A tibble: 6 × 6
##           Id mean_cal mean_steps mean_sleep mean_weight user_type     
##        <dbl>    <dbl>      <dbl>      <dbl>       <dbl> <chr>         
## 1 1503960366    1816.     12117.         NA        116. Very Active   
## 2 1624580081    1483.      5744.         NA         NA  Lightly Active
## 3 1644430081    2811.      7283.         NA         NA  Lightly Active
## 4 1844505072    1573.      2580.         NA         NA  Sedentary     
## 5 1927972279    2173.       916.         NA        294. Sedentary     
## 6 2022484408    2510.     11371.         NA         NA  Very Active

→ Next we can create a data frame by calculating the percentage of each user type so that we can visualize the distribution among users.

Usertype_percentile <- user_type %>% 
  group_by(user_type) %>%
  summarise(total = n()) %>% 
  mutate(totals =sum(total)) %>% 
  group_by(user_type) %>% 
  summarise(labels = scales::percent(total/totals))
Usertype_percentile$user_type <- factor(Usertype_percentile$user_type , labels = c("very active", "fairly active", "lightly active", "sedentary"))
head(Usertype_percentile)
## # A tibble: 4 × 2
##   user_type      labels
##   <fct>          <chr> 
## 1 very active    27%   
## 2 fairly active  27%   
## 3 lightly active 24%   
## 4 sedentary      21%

We can now visualize user type percentile.

Usertype_percentile %>% 
  ggplot(aes(x="",y=labels, fill=user_type)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+
  theme_minimal()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
  plot.title = element_text(hjust = 0.5, size=14, face = "bold"))+
  scale_fill_manual(values = c("green","yellow", "orange", "red")) +
  geom_text(aes(label = labels),
            position = position_stack(vjust = 0.5))+
  labs(title="User type percentiles")

5.3. Relationship between variables

5.3.1. Steps and Sleeps

→ User’s Sleeping behavior

We can check if there is any trends in user’s sleeping behavior based on the total steps taken by different types of users

Hypothesis: users who take more steps tends to sleep the desired amount of hour per day.

→ Let’s prove if the above hypothesis is true.

## User Type_2👤 
user_type_2 <- Joined_DailyActi_sleep_weight %>% 
  mutate(user_type = case_when
         (TotalSteps < 5000 ~ "Sedentary", TotalSteps >= 5000 & TotalSteps < 7499 ~ "Lightly Active", TotalSteps >= 7500 & TotalSteps < 9999 ~ "Fairly Active", TotalSteps >=10000 ~ "Very Active"))
head(user_type_2)
## # A tibble: 6 × 27
##           Id Activity_Date       Calor…¹ Total…² Total…³ Track…⁴ Logge…⁵ StepT…⁶
##        <dbl> <dttm>                <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 8877689391 2016-04-22 00:00:00    3427   18258    16.3    16.3       0   18258
## 2 8877689391 2016-04-22 00:00:00    3427   18258    16.3    16.3       0   18258
## 3 8877689391 2016-04-22 00:00:00    3427   18258    16.3    16.3       0   18258
## 4 8877689391 2016-04-22 00:00:00    3427   18258    16.3    16.3       0   18258
## 5 8877689391 2016-04-22 00:00:00    3427   18258    16.3    16.3       0   18258
## 6 8877689391 2016-04-22 00:00:00    3427   18258    16.3    16.3       0   18258
## # … with 19 more variables: SedentaryMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, VeryActiveMinutes <dbl>,
## #   SedentaryActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, VeryActiveDistance <dbl>,
## #   TotalSleepRecords <dbl>, TotalMinutesAsleep <dbl>, TotalTimeInBed <dbl>,
## #   Date <date>, WeightKg <dbl>, WeightPounds <dbl>, Fat <dbl>, BMI <dbl>,
## #   IsManualReport <lgl>, LogId <dbl>, user_type <chr>, and abbreviated …
ggplot(data = user_type_2, aes(x = TotalSteps, y = TotalMinutesAsleep))+
  geom_point(aes(color = user_type, shape = user_type))+
  facet_wrap(~user_type)+
  labs(title = "User's sleeping behavior", element_text(hjust = 0.5))

As we can see from the above graph there is no significant change shown in user’s sleeping behavior among different types of users and steps taken.

5.3.2. Steps and Weight in pound

Before we get into the analysis we know that only 9 users had data in the WeightLogInfo table so that the result from the analysis may not be accurate due to limited number of samples.

ggplot(data = user_type)+
  geom_point(mapping = aes(x = mean_steps, y = mean_weight, color = user_type, shape = user_type))+
  facet_wrap(~user_type)+
labs(title = "Steps and Weight", element_text(hjust = 0.5))

It can be inaccurate to say that users total steps and weight has a positive correlation, but as we can see from the above plot, we find the lowest average weight is in very active user categories and the highest is in Sedentary user categories.

5.3.3. Calories, steps and physical intensity

Now let’s use Tableau visualization to see if there is any relationship between calorie burned, total steps and physical intensity.

From the above visualization:-

→ The scatter plot shows that there is a positive Correlation between calories and total steps.

→ The Correlation between physical intensity and calories is not perfectly positive; the variables don’t move together by the exact same percent and direction.

→ Users doesn’t usual logged into the app during their activities as it shows that the table contains almost no data in the logged activity distance column.

5.4. Device Usage

5.4.1. Grouping users

Let’s create groups of users based on the number of days that our samples used their devices through out the one month intervals.

→ High: users those used their device between 21 and 31 days, → Medium: users who uses their device for 11 to 20 days and, → Low: those are used their devises between 0 to 10 days.

Now we can create a data frame called ‘device usage’ based on the above categories:

device_usage <- DailyActivity_Sleep %>% 
  group_by(Id) %>%
  summarise(numofdays_used = sum(n())) %>%
  mutate(dev_usage = case_when(
    numofdays_used >= 1 & numofdays_used <=10 ~ 'Low',
    numofdays_used >= 11 & numofdays_used <= 20 ~ 'Medium',
    numofdays_used >= 21 & numofdays_used <= 31 ~ 'High')) %>% 
  ungroup()
head(device_usage)
## # A tibble: 6 × 3
##           Id numofdays_used dev_usage
##        <dbl>          <int> <chr>    
## 1 1503960366             31 High     
## 2 1624580081             31 High     
## 3 1644430081             30 High     
## 4 1844505072             31 High     
## 5 1927972279             31 High     
## 6 2022484408             31 High

→ Calculating device usage percentage

dev.usage_percent <- device_usage %>% 
  group_by(dev_usage) %>% 
  summarise(total = sum(n())) %>% 
  mutate(percentage = scales::percent(total/sum(total))) 
head(dev.usage_percent)
## # A tibble: 3 × 3
##   dev_usage total percentage
##   <chr>     <int> <chr>     
## 1 High         29 87.9%     
## 2 Low           1 3.0%      
## 3 Medium        3 9.1%
pie3D(dev.usage_percent[[2]], labels = dev.usage_percent[[3]], main = "Device Usage Percentile", col = rainbow(3),radius = 1) 
  legend("topright", dev.usage_percent[[1]], cex = 0.5, fill = rainbow(length(dev.usage_percent[[1]])))

As we can see from the above analysis

87.9% of users use their devices more often between 21 days and 31 days,

3% of sample users use their device between 0 and 10 days which is low usage and,

9.1% of users use their devices between 11 days and 21 days.

5.4.2. Daily smart device usage

Now we can calculate the total minute that the device worn by users throughout the day by using the daily activity and intensity data and also to check if there is any trends we use device usage data frame.

First let’s merge daily activity table with device usage data frame.

daily_usage <- merged_data %>% 
  full_join(device_usage, by = c("Id"))
head(daily_usage)
## # A tibble: 6 × 18
##           Id Activity_Date       Calor…¹ Total…² Total…³ Track…⁴ Logge…⁵ StepT…⁶
##        <dbl> <dttm>                <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 8877689391 2016-04-22 00:00:00    3427   18258   16.3    16.3     0      18258
## 2 8877689391 2016-05-09 00:00:00    3710   20226   18.2    18.2     0      20226
## 3 6962181067 2016-04-21 00:00:00    2179   11835    9.71    7.88    4.08   11835
## 4 6962181067 2016-04-25 00:00:00    2194   13239    9.27    9.08    2.79   13239
## 5 4020332650 2016-05-05 00:00:00    3429   11728    8.43    8.43    0      11728
## 6 1644430081 2016-04-30 00:00:00    3846   18213   13.2    13.2     0      18213
## # … with 10 more variables: SedentaryMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, VeryActiveMinutes <dbl>,
## #   SedentaryActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, VeryActiveDistance <dbl>,
## #   numofdays_used <int>, dev_usage <chr>, and abbreviated variable names
## #   ¹​Calories, ²​TotalSteps, ³​TotalDistance, ⁴​TrackerDistance,
## #   ⁵​LoggedActivitiesDistance, ⁶​StepTotal

→ Now we can categorize users based on the total number of minute that users have their smart device on daily.

  • Full day

  • Half day and more

  • Less than half day

Let’s calculate total minute of daily smart device usage and percentage to assign into the above categories. For that we can use all the active minutes column and sedentary minute from the daily active table.

total_minu_usage <- daily_usage %>% 
  mutate(total_minute_wore = SedentaryMinutes + LightlyActiveMinutes + FairlyActiveMinutes + VeryActiveMinutes) %>% 
  mutate(percent_minu_wore = 100*(total_minute_wore/1440)) %>% 
  mutate(categories = case_when(
    percent_minu_wore == 100 ~ "Full day", percent_minu_wore < 100 & percent_minu_wore > 50 ~ "Half day & more", percent_minu_wore < 50 ~ "Less than half day"
  ))
head(total_minu_usage)
## # A tibble: 6 × 21
##           Id Activity_Date       Calor…¹ Total…² Total…³ Track…⁴ Logge…⁵ StepT…⁶
##        <dbl> <dttm>                <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 8877689391 2016-04-22 00:00:00    3427   18258   16.3    16.3     0      18258
## 2 8877689391 2016-05-09 00:00:00    3710   20226   18.2    18.2     0      20226
## 3 6962181067 2016-04-21 00:00:00    2179   11835    9.71    7.88    4.08   11835
## 4 6962181067 2016-04-25 00:00:00    2194   13239    9.27    9.08    2.79   13239
## 5 4020332650 2016-05-05 00:00:00    3429   11728    8.43    8.43    0      11728
## 6 1644430081 2016-04-30 00:00:00    3846   18213   13.2    13.2     0      18213
## # … with 13 more variables: SedentaryMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, VeryActiveMinutes <dbl>,
## #   SedentaryActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, VeryActiveDistance <dbl>,
## #   numofdays_used <int>, dev_usage <chr>, total_minute_wore <dbl>,
## #   percent_minu_wore <dbl>, categories <chr>, and abbreviated variable names
## #   ¹​Calories, ²​TotalSteps, ³​TotalDistance, ⁴​TrackerDistance, …

Let’s have the percentage of the above three categories.

percent_total_minu <- total_minu_usage %>% 
  group_by(categories) %>% 
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(categories) %>%
  summarise(total_minu_percent = total / totals) %>%
  mutate(labels = scales::percent(total_minu_percent))
head(percent_total_minu)
## # A tibble: 3 × 3
##   categories         total_minu_percent labels
##   <chr>                           <dbl> <chr> 
## 1 Full day                       0.509  50.9% 
## 2 Half day & more                0.465  46.5% 
## 3 Less than half day             0.0266 2.7%
pie3D(percent_total_minu[[2]], labels = percent_total_minu[[3]], main = "Daily Smart Device Usage", explode = 0.1, radius = 1, col= rainbow(length(percent_total_minu)))
legend("topright", percent_total_minu[[1]], cex = 0.5, fill = rainbow(length(percent_total_minu[[1]])))

  • 50.9% of users are the one who wore their device for full day,

  • 46.5% of users wore their smart device for more than half day and,

  • 2.7% of users wore their device less than a half day which is very few number of users.

We can now check if there is any trends in the number of days that the device used on monthly base and the total minute that the device worn on a daily base.

For this we use

→ Devise Usage:-

  • High
  • Medium
  • Low and,

→ User categories based on the number of minute that they wore their smart device:-

  • Full Day
  • Half day and more
  • Less than half day
ggplot(data = total_minu_usage)+
geom_bar(mapping = aes(x=categories, fill = categories ))+
  facet_wrap(~dev_usage)+
labs(text = "Device Usage and Total Minute Worn")+
  theme(axis.text.x = element_text(angle = 45))

As we can see from the above graph :-

  • The majority of users who wore their device for full day are under the category of High device users,

  • In the same way almost all of half day users also fall into the categories oh high device user’s categories

  • we can also say that medium and low device users wore their smart device less time throughout the day than high device users.

As we remember

High users are the one who uses their device between 21 days and 31 days,

Medium Users are users who uses their device between 11 to 20 days and,

Low users who uses their device less than 10 days in a month.

6. Act

6.1. Final Conclusion

As a reminder the objective of this analysis is to identify trends on how consumer use non-Bellabeat devices by analyzing smart device usage to ultimately draw a high-level recommendation which will inform Bellabeat marketing strategy.

To achieve a more reliable and useful high-level recommendations analyzing Bellabeat’s own data is important. As the sample size, age group and reliability of FitBit data is in question, further analysis required using additional information and resources.

Based on my analysis I concluded the following;

→ Only 21% of users are in “Very Active” user category who takes the recommended amount of daily steps.

Users rarely logged into their App while they are active.

Among different variables in the datasets, only total steps taken and calories have a positive correlation.

Most of users not fully utilize all the features in the App which can benefit them in many ways for example WeightLog features.

87.9% of users use their device more often during one month interval and 50.9% of users wore their smart device the whole time.

6.2. Recommendations

Based on the above conclusion the following recommendation drawn to help in planing Bellabeat’s future marketing strategies:

1. Engaging Users

For users to be more interactive with the Bellabeat App, we need to include more features which benefit and attract users; to name a few:-

→ A matching features for users to find a walking partners that way users can also encouraged to walk their recommended daily steps and

→ Different types of club for users to join in the App with other Bellabeat users.

Bellabeat App is first Woman’s health App with private key encryption which gives a confidence for users that their personal data is private.

In addition since the focus of Bellabeat is giving women the tools to live in harmony with themselves, it is important to continually promote all the useful features in the App and all products to help women reach their wellness goals.

2. Benefit And Reward

To encourage customers achieve their personal goal we can have a reward system that is when users achieve to a specified goal in a certain time period they could collect points, which ultimately can be redeemable in a form of discount, coupons, or Bellabeat product give away depending on The Wellness Score features of the App which shows step, activity, sleep, meditation & hydration progress throughout the day & night

3. Notification and Awareness

To get users more exited on receiving notification and useful awareness tips without making them annoyed and overwhelmed, we need to have a smart way to send a more personalized notifications, tips and encouragement on a daily base.

Ways that a Wellness Technology Company Play It Smart