SUMMARY

Bellabeat is a high-tech company that manufactures health-focused smart products.They offer different smart devices that collect data on activity, sleep, stress, and reproductive health to empower women with knowledge about their own health and habits.

MISSION STATEMENT

To analyze smart device fitness data to gain insight into how consumers are using their smart devices and use these insights to guide Bellabeat’s marketing strategy for growth in the global smart device market

PHASE1:ASK

BUSINESS TASK

The primary business goal is to utilize external data on smart water bottle usage to refine Bellabeat’s product development and marketing strategies within the smart water bottle niche. This effort is focused on gaining in-depth insights into consumer behaviors and preferences specific to smart water bottles. By achieving this objective, Bellabeat aims to optimize its approach, cater effectively to potential smart water bottle customers, and strategically position its products for success in the ever-evolving smart water bottle market.

PRODUCT

Spring - water bottle https://bellabeat.com/product/spring/

APPLICATION TO BELLABEAT CUSTOMERS

Spring’s app and smart technology can calculate the optimal amount of water for user’s body and remind users of water intake base on the users age, height, weight, local weather, activity level, pregnancy or breastfeeding, help to remind users of water consumption. Thus, Spring, as smart bottle that can help remind users avoid dehydration, establish, and maintain healthy hydration habit, is considerable product for development in the market.

KEY STAKEHOLDERS

The main stakeholders here are Urška Sršen, Bellabeat’s co-founder and Chief Creative Officer; Sando Mur, Mathematician and Bellabeat’s cofounder; And the rest of the Bellabeat marketing analytics team.

PHASE2:PREPARE

DATA USED

The data source used for our case study is FitBit Fitness Tracker Data. This dataset is stored in Kaggle and was made available through Mobius.

ACCESSIBILITY AND PRIVACY OF DATA

Verifying the metadata of our dataset we can confirm it is open-source. The owner has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law. You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.

INFORMATION ON DATASET

These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors / preferences.

ROCCC ANALYSIS

  • Reliability : LOW – dataset was collected from 30 individuals whose gender is unknown.
  • Originality : LOW – third party data collect using Amazon Mechanical Turk.
  • Comprehensive : MEDIUM – dataset contains multiple fields on daily activity intensity, calories used, daily steps taken, daily sleep time and weight record.
  • Current : MEDIUM – data is 5 years old but the habit of how people live does not change over a few years
  • Cited : HIGH – data collector and source is well documented

DATA ORGANISATION

Available to us are 6 CSV documents and a excel sheet created by research on other smart water bottle products. Each document represents different quantitative data tracked by Fitbit. The data is considered long since each row is one time point per subject, so each subject will have data in multiple rows.Every user has a unique ID and different rows since data is tracked by day and time. Counted sample size (users) of each table and verified time length of analysis - 31 days.

data <- data.frame(
  Filename = c("dailyActivity_merged.csv", "sleepDay_merged.csv", "dailySteps_merged.csv","dailyIntensties_merged.csv", "dailyCalories_merged.csv", "weightLogInfo_merged.csv","Smart_waterbottles.xlsx"),
  TypeOfFile = c("CSV", "CSV", "CSV", "CSV", "CSV", "CSV","XLSX"),
  Description = c(
    "Daily Activity over 31 days of 33 users. Tracking daily: Steps, Distance, Intensities, Calories",
    "Daily sleep logs, tracked by: Total count of sleeps a day, Total minutes, Total Time in Bed",
    "Daily Steps over 31 days of 33 users",
    "Daily Intensity over 31 days of 33 users. Measured in Minutes and Distance, dividing groups in 4 categories: Sedentary, Lightly Active, Fairly Active,Very Active",
    "Daily Calories over 31 days of 33 users",
    "Weight track by day in Kg and Pounds over 30 days. Calculation of BMI.5 users report weight manually 3 users not.In total there are 8 users",
    "Data on 6 other smart watter bottle products"
  )
)
print(data)
##                     Filename TypeOfFile
## 1   dailyActivity_merged.csv        CSV
## 2        sleepDay_merged.csv        CSV
## 3      dailySteps_merged.csv        CSV
## 4 dailyIntensties_merged.csv        CSV
## 5   dailyCalories_merged.csv        CSV
## 6   weightLogInfo_merged.csv        CSV
## 7    Smart_waterbottles.xlsx       XLSX
##                                                                                                                                                         Description
## 1                                                                   Daily Activity over 31 days of 33 users. Tracking daily: Steps, Distance, Intensities, Calories
## 2                                                                       Daily sleep logs, tracked by: Total count of sleeps a day, Total minutes, Total Time in Bed
## 3                                                                                                                              Daily Steps over 31 days of 33 users
## 4 Daily Intensity over 31 days of 33 users. Measured in Minutes and Distance, dividing groups in 4 categories: Sedentary, Lightly Active, Fairly Active,Very Active
## 5                                                                                                                           Daily Calories over 31 days of 33 users
## 6                       Weight track by day in Kg and Pounds over 30 days. Calculation of BMI.5 users report weight manually 3 users not.In total there are 8 users
## 7                                                                                                                      Data on 6 other smart watter bottle products

DATA CREDIBILITY AND INTEGRITY

Due to the limitation of size (30 users) and not having any demographic information we could encounter a sampling bias. We are not sure if the sample is representative of the population as a whole. Another problem we would encounter is that the dataset is not current and also the time limitation of the survey (2 months long). That is why we will give our case study an operational approach.

PHASE3:PROCESS

The entire analysis is done in RStudio.

INSTALLING PACKAGES AND LIBRARIES

We will choose the packages that will help us on our analysis and open them. We will use the following packages for our analysis:

  • tidyverse
  • here
  • skimr
  • janitor
  • lubridate
  • ggpubr
  • ggrepel
  • hms
library(ggpubr)
## Loading required package: ggplot2
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(here)
## here() starts at C:/Users/Harish/OneDrive/Documents/Rcase_studies/bellabeat_casestudy
library(skimr)
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(lubridate)
library(ggrepel)
library(readxl)

IMPORTING DATASETS

Knowing the datasets we have, we will upload the datasets that will help us answer our business task. On our analysis we will focus on the following datasets

  • Daily_activity
  • Daily_intensities
  • Daily_calories

Due to the the small sample we won’t consider Weight (8 Users) for this analysis.

daily_steps <- read.csv("dailySteps_merged.csv")
daily_intensities <- read.csv("dailyIntensities_merged.csv")
daily_calories <- read.csv("dailyCalories_merged.csv")
hourly_steps <- read.csv("hourlySteps_merged.csv")
daily_activity <- read.csv("dailyActivity_merged.csv")
products <- read_excel("smart_waterbottles.xlsx")

PREVIEW DATASET

We will preview our selected data frames and check the summary of each column.

head(daily_steps)
##           Id ActivityDay StepTotal
## 1 1503960366   4/12/2016     13162
## 2 1503960366   4/13/2016     10735
## 3 1503960366   4/14/2016     10460
## 4 1503960366   4/15/2016      9762
## 5 1503960366   4/16/2016     12669
## 6 1503960366   4/17/2016      9705
str(daily_steps)
## 'data.frame':    940 obs. of  3 variables:
##  $ Id         : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDay: chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ StepTotal  : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
head(daily_intensities)
##           Id ActivityDay SedentaryMinutes LightlyActiveMinutes
## 1 1503960366   4/12/2016              728                  328
## 2 1503960366   4/13/2016              776                  217
## 3 1503960366   4/14/2016             1218                  181
## 4 1503960366   4/15/2016              726                  209
## 5 1503960366   4/16/2016              773                  221
## 6 1503960366   4/17/2016              539                  164
##   FairlyActiveMinutes VeryActiveMinutes SedentaryActiveDistance
## 1                  13                25                       0
## 2                  19                21                       0
## 3                  11                30                       0
## 4                  34                29                       0
## 5                  10                36                       0
## 6                  20                38                       0
##   LightActiveDistance ModeratelyActiveDistance VeryActiveDistance
## 1                6.06                     0.55               1.88
## 2                4.71                     0.69               1.57
## 3                3.91                     0.40               2.44
## 4                2.83                     1.26               2.14
## 5                5.04                     0.41               2.71
## 6                2.51                     0.78               3.19
str(daily_intensities)
## 'data.frame':    940 obs. of  10 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDay             : chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...
head(daily_calories)
##           Id ActivityDay Calories
## 1 1503960366   4/12/2016     1985
## 2 1503960366   4/13/2016     1797
## 3 1503960366   4/14/2016     1776
## 4 1503960366   4/15/2016     1745
## 5 1503960366   4/16/2016     1863
## 6 1503960366   4/17/2016     1728
str(daily_calories)
## 'data.frame':    940 obs. of  3 variables:
##  $ Id         : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDay: chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ Calories   : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
head(hourly_steps)
##           Id          ActivityHour StepTotal
## 1 1503960366 4/12/2016 12:00:00 AM       373
## 2 1503960366  4/12/2016 1:00:00 AM       160
## 3 1503960366  4/12/2016 2:00:00 AM       151
## 4 1503960366  4/12/2016 3:00:00 AM         0
## 5 1503960366  4/12/2016 4:00:00 AM         0
## 6 1503960366  4/12/2016 5:00:00 AM         0
str(hourly_steps)
## 'data.frame':    22099 obs. of  3 variables:
##  $ Id          : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityHour: chr  "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
##  $ StepTotal   : int  373 160 151 0 0 0 0 0 250 1864 ...
head(daily_activity)
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366    4/12/2016      13162          8.50            8.50
## 2 1503960366    4/13/2016      10735          6.97            6.97
## 3 1503960366    4/14/2016      10460          6.74            6.74
## 4 1503960366    4/15/2016       9762          6.28            6.28
## 5 1503960366    4/16/2016      12669          8.16            8.16
## 6 1503960366    4/17/2016       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728
str(daily_activity)
## 'data.frame':    940 obs. of  15 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
##  $ TotalDistance           : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ Calories                : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...

CLEANING AND FORMATTING

Now that we got to know more about our data structures we will process them to look for any errors and inconsistencies.

VERIFYING NUMBER OF USERS

Before we continue with our cleaning we want to make sure how many unique users are per data frame.

n_unique(daily_steps$Id)
## [1] 33
n_unique(daily_intensities$Id)
## [1] 33
n_unique(daily_calories$Id)
## [1] 33
n_unique(hourly_steps$Id)
## [1] 33
n_unique(daily_activity$Id)
## [1] 33

DUPLICATES

We will now look for any duplicates

sum(duplicated(daily_steps))
## [1] 0
sum(duplicated(daily_calories))
## [1] 0
sum(duplicated(daily_intensities))
## [1] 0
sum(duplicated(hourly_steps))
## [1] 0
sum(duplicated(daily_activity))
## [1] 0

CLEAN AND RENAME COLUMNS

We want to ensure that column names are using right syntax and same format in all datasets since we will merge them later on. We are changing the format of all columns to lower case.

clean_names(daily_steps)
daily_steps<- rename_with(daily_steps, tolower)

clean_names(daily_calories)
daily_calories <- rename_with(daily_calories, tolower)

clean_names(daily_intensities)
daily_intensities<- rename_with(daily_intensities, tolower)

clean_names(hourly_steps)
hourly_steps <- rename_with(hourly_steps, tolower)

clean_names(daily_activity)
daily_activity <- rename_with(daily_activity, tolower)

products <- rename_with(products, tolower)

CONSISTENCY OF COLUMNS

Make sure the column names are consistent across the files used and check date format.

hourly_steps <- hourly_steps %>%
  rename(date_time = activityhour) %>%
  mutate(date_time = as.POSIXct(date_time, format = "%m/%d/%Y %I:%M:%S %p"))
head(hourly_steps)
##           id           date_time steptotal
## 1 1503960366 2016-04-12 00:00:00       373
## 2 1503960366 2016-04-12 01:00:00       160
## 3 1503960366 2016-04-12 02:00:00       151
## 4 1503960366 2016-04-12 03:00:00         0
## 5 1503960366 2016-04-12 04:00:00         0
## 6 1503960366 2016-04-12 05:00:00         0

MERGING DATASETS

We will merge daily_intensities and daily_steps with daily_calories to see correlation between variables by using id as their primary keys.

user_dailyx <- merge(daily_intensities,daily_calories ,by =c ("id","activityday"))
user_dailyx$activityday <- as.Date(user_dailyx$activityday, format = "%m/%d/%Y")
glimpse(user_dailyx)
## Rows: 940
## Columns: 11
## $ id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ activityday              <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-04-…
## $ sedentaryminutes         <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ lightlyactiveminutes     <int> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ fairlyactiveminutes      <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ veryactiveminutes        <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ sedentaryactivedistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ lightactivedistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ moderatelyactivedistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ veryactivedistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ calories                 <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…
user_dailyy <- merge(daily_steps ,daily_calories ,by =c ("id","activityday"))

user_dailyy$activityday <- as.Date(user_dailyy$activityday, format = "%m/%d/%Y")
glimpse(user_dailyy)
## Rows: 940
## Columns: 4
## $ id          <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960366…
## $ activityday <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-04-15, 2016-04-1…
## $ steptotal   <int> 13162, 10735, 10460, 9762, 12669, 9705, 13019, 15506, 1054…
## $ calories    <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2035, 1786, 1775…
user_dailyz <- merge(daily_intensities ,daily_steps ,by =c ("id","activityday"))
user_dailyz$activityday <- as.Date(user_dailyz$activityday, format = "%m/%d/%Y")
glimpse(user_dailyz)
## Rows: 940
## Columns: 11
## $ id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ activityday              <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-04-…
## $ sedentaryminutes         <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ lightlyactiveminutes     <int> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ fairlyactiveminutes      <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ veryactiveminutes        <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ sedentaryactivedistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ lightactivedistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ moderatelyactivedistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ veryactivedistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ steptotal                <int> 13162, 10735, 10460, 9762, 12669, 9705, 13019…

PHASE4-5:ANALYZE-SHARE

TYPE OF USERS

We can classify the users by activity considering the daily amount of steps. We can categorize users as follows

  • Sedentary - Less than 5000 steps a day.
  • Lightly active - Between 5000 and 7499 steps a day.
  • Fairly active - Between 7500 and 9999 steps a day.
  • Very active - More than 10000 steps a day.

Classification has been made per the following article https://www.10000steps.org.au/articles/counting-steps/ First we will calculate the daily steps average by user.

daily_average <- user_dailyy %>%
  group_by(id) %>%
  summarise (mean_daily_steps = mean(steptotal), mean_daily_calories = mean(calories))

head(daily_average)
## # A tibble: 6 × 3
##           id mean_daily_steps mean_daily_calories
##        <dbl>            <dbl>               <dbl>
## 1 1503960366           12117.               1816.
## 2 1624580081            5744.               1483.
## 3 1644430081            7283.               2811.
## 4 1844505072            2580.               1573.
## 5 1927972279             916.               2173.
## 6 2022484408           11371.               2510.

We will now classify users based on average daily steps

user_type <- daily_average %>%
  mutate(user_type = case_when(
    mean_daily_steps < 5000 ~ "sedentary",
    mean_daily_steps >= 5000 & mean_daily_steps < 7499 ~ "lightly active", 
    mean_daily_steps >= 7500 & mean_daily_steps < 9999 ~ "fairly active", 
    mean_daily_steps >= 10000 ~ "very active"
  ))

head(user_type)
## # A tibble: 6 × 4
##           id mean_daily_steps mean_daily_calories user_type     
##        <dbl>            <dbl>               <dbl> <chr>         
## 1 1503960366           12117.               1816. very active   
## 2 1624580081            5744.               1483. lightly active
## 3 1644430081            7283.               2811. lightly active
## 4 1844505072            2580.               1573. sedentary     
## 5 1927972279             916.               2173. sedentary     
## 6 2022484408           11371.               2510. very active

Now that we have a new column with the user type we will create a data frame with the percentage of each user type to better visualize them on a graph.

user_type_percent <- user_type %>%
  group_by(user_type) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(user_type) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales::percent(total_percent))

user_type_percent$user_type <- factor(user_type_percent$user_type , levels = c("very active", "fairly active", "lightly active", "sedentary"))


head(user_type_percent)
## # A tibble: 4 × 3
##   user_type      total_percent labels
##   <fct>                  <dbl> <chr> 
## 1 fairly active          0.273 27.3% 
## 2 lightly active         0.273 27.3% 
## 3 sedentary              0.242 24.2% 
## 4 very active            0.212 21.2%

Below we can see that users are fairly distributed by their activity considering the daily amount of steps. We can determine that based on users activity all kind of users use smart-devices.

user_type_percent %>%
  ggplot(aes(x="",y=total_percent, fill=user_type)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+
  theme_minimal()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold")) +
  scale_fill_manual(values = c("#85e085","#e6e600", "#ffd480", "#ff8080")) +
  geom_text(aes(label = labels),
            position = position_stack(vjust = 0.5))+
  labs(title="User type distribution")

STEPS PER DAYS OF WEEK

We want to know now what days of the week are the users more active. We will also verify if the users walk the recommended amount of steps. Below we are calculating the weekdays based on our column date. We are also calculating the average steps walked by days of week.

weekday_steps <- user_dailyy %>%
  mutate(weekday = weekdays(activityday))

weekday_steps$weekday <-ordered(weekday_steps$weekday, levels=c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday"))

weekday_steps<-weekday_steps%>%
  group_by(weekday) %>%
  summarize (user_dailyy= mean(steptotal))

head(weekday_steps)
## # A tibble: 6 × 2
##   weekday   user_dailyy
##   <ord>           <dbl>
## 1 Monday          7781.
## 2 Tuesday         8125.
## 3 Wednesday       7559.
## 4 Thursday        7406.
## 5 Friday          7448.
## 6 Saturday        8153.
ggarrange(
    ggplot(weekday_steps) +
      geom_col(aes(weekday, user_dailyy), fill = "#006699") +
      geom_hline(yintercept = 7500) +
      labs(title = "Daily steps per weekday", x= "", y = "") +
      theme(axis.text.x = element_text(angle = 45,vjust = 0.5, hjust = 1))
)

In the graph above we can determine that users mostly walk daily the recommended amount of steps of 7500 besides holdidays(Sunday).

USER ACTIVITY

Getting deeper into our analysis we want to know when exactly are users more active in a day.

We will use the hourly_steps data frame and separate activityhour column.

hourly_steps <- hourly_steps %>%
  separate(date_time, into = c("activityday", "time"), sep = " ")%>%
  mutate(activityday = parse_date(activityday))
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 934 rows [1, 25, 49, 73,
## 97, 121, 145, 169, 193, 217, 241, 265, 289, 313, 337, 361, 385, 409, 433, 457,
## ...].
hourly_steps$time[is.na(hourly_steps$time)] <- "00:00:00"
head(hourly_steps)
##           id activityday     time steptotal
## 1 1503960366  2016-04-12 00:00:00       373
## 2 1503960366  2016-04-12 01:00:00       160
## 3 1503960366  2016-04-12 02:00:00       151
## 4 1503960366  2016-04-12 03:00:00         0
## 5 1503960366  2016-04-12 04:00:00         0
## 6 1503960366  2016-04-12 05:00:00         0
hourly_steps %>%
  group_by(time) %>%
  summarize(average_steps = mean(steptotal)) %>%
  ggplot() +
  geom_col(mapping = aes(x=time, y = average_steps, fill = average_steps)) + 
  labs(title = "Hourly steps throughout the day", x="", y="") + 
  scale_fill_gradient(low = "green", high = "red")+
  theme(axis.text.x = element_text(angle = 90))

We can see that users are more active between 8am and 7pm walking more steps during lunch time from 12pm to 2pm and evenings from 5pm and 7pm.

CORRELATIONS

We will now determine if there is any correlation between different variables: *Daily steps and calories

ggarrange(
ggplot(user_dailyy, aes(x=steptotal, y=calories))+
  geom_jitter() +
  geom_smooth(color = "red") + 
  labs(title = "Daily steps vs Calories", x = "Daily steps", y= "Calories") +
   theme(panel.background = element_blank(),
        plot.title = element_text( size=14))
)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

We can see a positive correlation between steps and calories burned. As assumed the more steps walked the more calories may be burned.

USE OF SMART DEVICE

Days used smart device

We will calculate the number of users that use their smart device on a daily basis, classifying our sample into three categories knowing that the date interval is 31 days

  • high use - users who use their device between 21 and 31 days.
  • moderate use - users who use their device between 10 and 20 days.
  • low use - users who use their device between 1 and 10 days.

First we will create a new data frame grouping by Id, calculating number of days used and creating a new column with the classification explained above.

daily_use <- user_dailyy %>%
  group_by(id) %>%
  summarize(days_used=sum(n())) %>%
  mutate(usage = case_when(
    days_used >= 1 & days_used <= 10 ~ "low use",
    days_used >= 11 & days_used <= 20 ~ "moderate use", 
    days_used >= 21 & days_used <= 31 ~ "high use", 
  ))
  
head(daily_use)
## # A tibble: 6 × 3
##           id days_used usage   
##        <dbl>     <int> <chr>   
## 1 1503960366        31 high use
## 2 1624580081        31 high use
## 3 1644430081        30 high use
## 4 1844505072        31 high use
## 5 1927972279        31 high use
## 6 2022484408        31 high use

We will now create a percentage data frame to better visualize the results in the graph. We are also ordering our usage levels.

daily_use_percent <- daily_use %>%
  group_by(usage) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(usage) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales::percent(total_percent))

daily_use_percent$usage <- factor(daily_use_percent$usage, levels = c("high use", "moderate use", "low use"))

head(daily_use_percent)
## # A tibble: 3 × 3
##   usage        total_percent labels
##   <fct>                <dbl> <chr> 
## 1 high use            0.879  87.9% 
## 2 low use             0.0303 3.0%  
## 3 moderate use        0.0909 9.1%
daily_use_percent %>%
  ggplot(aes(x="",y=total_percent, fill=usage)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+
  theme_minimal()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold")) +
  geom_text(aes(label = labels),
            position = position_stack(vjust = 0.5))+
  scale_fill_manual(values = c("#006633","#00e673","#80ffbf"),
                    labels = c("High use - 21 to 31 days",
                                 "Moderate use - 11 to 20 days",
                                 "Low use - 1 to 10 days"))+
  labs(title="Daily use of smart device")

Analyzing our results we can see that

  • 88% of the users of our sample use their device frequently between 21 to 31 days.
  • 9% use their device 11 to 20 days.
  • 3% of our sample use really rarely their device.

Time used smart device

Being more precise we want to see how many minutes do users wear their device per day. For that we will merge the created daily_use data frame and daily_activity to be able to filter results by daily use of device as well.

daily_use_merged <- merge(daily_activity, daily_use, by=c ("id"))
head(daily_use_merged)
##           id activitydate totalsteps totaldistance trackerdistance
## 1 1503960366    4/12/2016      13162          8.50            8.50
## 2 1503960366    4/13/2016      10735          6.97            6.97
## 3 1503960366    4/14/2016      10460          6.74            6.74
## 4 1503960366    4/15/2016       9762          6.28            6.28
## 5 1503960366    4/16/2016      12669          8.16            8.16
## 6 1503960366    4/17/2016       9705          6.48            6.48
##   loggedactivitiesdistance veryactivedistance moderatelyactivedistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   lightactivedistance sedentaryactivedistance veryactiveminutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   fairlyactiveminutes lightlyactiveminutes sedentaryminutes calories days_used
## 1                  13                  328              728     1985        31
## 2                  19                  217              776     1797        31
## 3                  11                  181             1218     1776        31
## 4                  34                  209              726     1745        31
## 5                  10                  221              773     1863        31
## 6                  20                  164              539     1728        31
##      usage
## 1 high use
## 2 high use
## 3 high use
## 4 high use
## 5 high use
## 6 high use

We need to create a new data frame calculating the total amount of minutes users wore the device every day and creating three different categories

  • All day - device was worn all day.
  • More than half day - device was worn more than half of the day.
  • Less than half day - device was worn less than half of the day.
minutes_worn <- daily_use_merged %>% 
  mutate(total_minutes_worn = veryactiveminutes+fairlyactiveminutes+lightlyactiveminutes+sedentaryminutes)%>%
  mutate (percent_minutes_worn = (total_minutes_worn/1440)*100) %>%
  mutate (worn = case_when(
    percent_minutes_worn == 100 ~ "All day",
    percent_minutes_worn < 100 & percent_minutes_worn >= 50~ "More than half day", 
    percent_minutes_worn < 50 & percent_minutes_worn > 0 ~ "Less than half day"
  ))

head(minutes_worn)
##           id activitydate totalsteps totaldistance trackerdistance
## 1 1503960366    4/12/2016      13162          8.50            8.50
## 2 1503960366    4/13/2016      10735          6.97            6.97
## 3 1503960366    4/14/2016      10460          6.74            6.74
## 4 1503960366    4/15/2016       9762          6.28            6.28
## 5 1503960366    4/16/2016      12669          8.16            8.16
## 6 1503960366    4/17/2016       9705          6.48            6.48
##   loggedactivitiesdistance veryactivedistance moderatelyactivedistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   lightactivedistance sedentaryactivedistance veryactiveminutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   fairlyactiveminutes lightlyactiveminutes sedentaryminutes calories days_used
## 1                  13                  328              728     1985        31
## 2                  19                  217              776     1797        31
## 3                  11                  181             1218     1776        31
## 4                  34                  209              726     1745        31
## 5                  10                  221              773     1863        31
## 6                  20                  164              539     1728        31
##      usage total_minutes_worn percent_minutes_worn               worn
## 1 high use               1094             75.97222 More than half day
## 2 high use               1033             71.73611 More than half day
## 3 high use               1440            100.00000            All day
## 4 high use                998             69.30556 More than half day
## 5 high use               1040             72.22222 More than half day
## 6 high use                761             52.84722 More than half day

As we have done before, to better visualize our results we will create new data frames. In this case we will create four different data frames to arrange them later on on a same visualization.

  • First data frame will show the total of users and will calculate percentage of minutes worn the device taking into consideration the three categories created.
minutes_worn_percent<- minutes_worn%>%
  group_by(worn) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(worn) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales::percent(total_percent))
head(minutes_worn_percent)
## # A tibble: 3 × 3
##   worn               total_percent labels
##   <chr>                      <dbl> <chr> 
## 1 All day                   0.509  50.9% 
## 2 Less than half day        0.0266 2.7%  
## 3 More than half day        0.465  46.5%
  • The three other data frames are filtered by category of daily users so that we can see also the difference of daily use and time use.
minutes_worn_highuse <- minutes_worn%>%
  filter (usage == "high use")%>%
  group_by(worn) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(worn) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales::percent(total_percent))

minutes_worn_moduse <- minutes_worn%>%
  filter(usage == "moderate use") %>%
  group_by(worn) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(worn) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales::percent(total_percent))

minutes_worn_lowuse <- minutes_worn%>%
  filter (usage == "low use") %>%
  group_by(worn) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(worn) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales::percent(total_percent))

minutes_worn_highuse$worn <- factor(minutes_worn_highuse$worn, levels = c("All day", "More than half day", "Less than half day"))
minutes_worn_percent$worn <- factor(minutes_worn_percent$worn, levels = c("All day", "More than half day", "Less than half day"))
minutes_worn_moduse$worn <- factor(minutes_worn_moduse$worn, levels = c("All day", "More than half day", "Less than half day"))
minutes_worn_lowuse$worn <- factor(minutes_worn_lowuse$worn, levels = c("All day", "More than half day", "Less than half day"))

head(minutes_worn_highuse)
## # A tibble: 3 × 3
##   worn               total_percent labels
##   <fct>                      <dbl> <chr> 
## 1 All day                   0.498  49.8% 
## 2 Less than half day        0.0273 2.7%  
## 3 More than half day        0.474  47.4%
head(minutes_worn_moduse)
## # A tibble: 3 × 3
##   worn               total_percent labels
##   <fct>                      <dbl> <chr> 
## 1 All day                   0.649  65%   
## 2 Less than half day        0.0175 2%    
## 3 More than half day        0.333  33%
head(minutes_worn_lowuse)
## # A tibble: 2 × 3
##   worn               total_percent labels
##   <fct>                      <dbl> <chr> 
## 1 All day                     0.75 75%   
## 2 More than half day          0.25 25%

Now that we have created the four data frames and also ordered worn level categories, we can visualize our results in the following plots. All the plots have been arranged together for a better visualization.

ggarrange(
  ggplot(minutes_worn_percent, aes(x="",y=total_percent, fill=worn)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+
  theme_minimal()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5)) +
    scale_fill_manual(values = c("#004d99", "#3399ff", "#cce6ff"))+
  geom_text(aes(label = labels),
            position = position_stack(vjust = 0.5), size = 3.5)+
  labs(title="Time worn per day", subtitle = "Total Users"),
  ggarrange(
  ggplot(minutes_worn_highuse, aes(x="",y=total_percent, fill=worn)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+
  theme_minimal()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5), 
        legend.position = "none")+
    scale_fill_manual(values = c("#004d99", "#3399ff", "#cce6ff"))+
  geom_text_repel(aes(label = labels),
            position = position_stack(vjust = 0.5), size = 3)+
  labs(title="", subtitle = "High use - Users"), 
  ggplot(minutes_worn_moduse, aes(x="",y=total_percent, fill=worn)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+
  theme_minimal()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold"), 
        plot.subtitle = element_text(hjust = 0.5),
        legend.position = "none") +
    scale_fill_manual(values = c("#004d99", "#3399ff", "#cce6ff"))+
  geom_text(aes(label = labels),
            position = position_stack(vjust = 0.5), size = 3)+
  labs(title="", subtitle = "Moderate use - Users"), 
  ggplot(minutes_worn_lowuse, aes(x="",y=total_percent, fill=worn)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+
  theme_minimal()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold"), 
        plot.subtitle = element_text(hjust = 0.5),
        legend.position = "none") +
    scale_fill_manual(values = c("#004d99", "#3399ff", "#cce6ff"))+
  geom_text(aes(label = labels),
            position = position_stack(vjust = 0.5), size = 3)+
  labs(title="", subtitle = "Low use - Users"), 
  ncol = 3),
  nrow = 2)

Per our plots we can see that 51% of the total of users wear the device all day long, 46% more than half day long and just 3% less than half day.

Just a reminder

  • high use - users who use their device between 21 and 31 days.
  • moderate use - users who use their device between 10 and 20 days.
  • low use - users who use their device between 1 and 10 days.

If we filter the total users considering the days they have used the device and also check each day how long they have worn the device, we have the following results:

  • High users: About 50% of the users that have used their device between 21 and 31 days wear it all day. 47.4% use the device more than half day but not all day.
  • Moderate and low users wear the device more on the days they use it.

MARKET COMPETITION

We use the excel sheet which has data gathered by research on other products similar to Bellabeat’s SPRING.

ggplot(products, aes(x = ratings, y = price,color = product)) +
  geom_point() +
  theme(plot.title = element_text(hjust = 0.5, size=14, face = "bold")) +
  labs(title = "Market Competitors")

PHASE6:ACT

Bellabeat’s mission is deeply rooted in empowering women through data-driven insights. To effectively support Bellabeat’s mission and address our business objectives, it is imperative to harness our own comprehensive tracking data for in-depth analysis. The data sets we’ve employed thus far have limitations, primarily their small sample size and the absence of user demographic information. As our primary target demographic comprises young and adult women, it is crucial to persist in uncovering actionable trends within our data sets. This ongoing pursuit of insights will enable us to craft a focused and effective marketing strategy that resonates with our core audience, ensuring that we continue to serve and empower women in their health and wellness journeys.

That being said, after our analysis we have found different trends that may help our online campaign and improve Bellabeat SPRING

FEATURES TO BE IMPROVED

On our analysis we didn’t just check trends on daily users habits we also realized that 88% of the users use their device on a daily basis and that 50% of the users wear the device all time the day they used it. We can continue promote Bellabeat’s products features:

  • Water-resistant: Enhancing the water-resistant capabilities of the product will ensure it remains functional and reliable, even in wet or humid conditions.
  • Long-lasting batteries: Improving battery life is essential for providing users with a more reliable and longer-lasting experience, reducing the need for frequent recharging.
  • Fashion/elegant product: Elevating the design and aesthetics of the product will make it more appealing, combining style with functionality for a more attractive user experience.
  • Insulation: Enhancing insulation features can better protect the product’s internal components from temperature variations and environmental factors, ensuring consistent performance.
  • Bellabeat app auto-sync feature: Implementing an automatic syncing feature in the Bellabeat app will streamline data transfer and make it more convenient for users to access their health and wellness information.

RECOMMENDATIONS

  • Unique feature:Incorporating these innovative features can enhance the functionalityand desirability of smart water bottles, making them more attractive to consumers who want to maintain a healthy and eco conscious lifestyle
    • Emergency Features: Include a distress signal or emergency beacon for outdoor enthusiasts or those in need of assistance.
    • Leak Detection: Incorporate sensors to detect leaks and send alerts to prevent messes and water wastage.
    • Social Sharing: Enable users to share their hydration achievements and challenges with friends and social networks.
    • Eco-Friendly Materials: Use sustainable and recyclable materials in the bottle’s construction to reduce its environmental impact.
  • Challenges between users:To promote usage of the logged activities feature, Bellabeat could advertise challenges between users to promote usage.If the budget permits, there could be an incentive involved with winning a certain amount of challenges.