Scenario

About the Company

Bellabeats was founded by Urska Srsen and Sando Mur in 2013. High tech company focusing on wellness for women using beautifully designed technology that informs and inspires. Website

Products

  • Bellabeats app: provides users with health data related to their activity, sleep, stress, menstrual cycle and mindfulness habits.

  • Leaf: Bellabeats wellness tracker that can be worn as a bracelet, necklace or clip. Connects to bellabeats app to track activity, sleep and stress.

  • Time: wellness smart watch that also connects to the bellabeats app and tracks users.

  • Spring: water bottle that tracks daily water intake with app.

  • Bellabeats membership: 24/7 subscription based membership that personalizes guidance on nutrition, activity, sleep, health, beauty, and mindfulness based on lifestyle and goals.

ASK

Stakeholders

  • Urska Srsen: co founder and chief Creative officer of Bellabeats

  • Sando Mur: Mathematician and Bellabeats co founder

  • Bellabeats marketing analytics team

Business Task:

Junior data analyst working on the marketing team at Bellabeats tasked with analyzing smart device data to help unlock new growth opportunity by focusing on one bellabeats product and analyzing fitbit data to gain insight on how consumers are using there smart devices.

Questions for Analysis:

  1. What are some trends in smart device usage?

  2. How could these trends apply to Bellabeat customers?

  3. How could these trends help influence Bellabeat marketing strategy?

Prepare

Data set

FitBit Fitness Tracker Data by Mobius is a well documented data set generated by a distributed survey via Amazon Mechanical Turk between March 12th 2016 to May 12th 2016. 30 users consented to submitting personal tracker data that covers physical activity, sleep, and heart rate.

Credibility

Outdated data from 2016 with a small sample size of 30 participants would not be enough to get accurate representative data for business decisions but will be used to to showcase my data analytic skills and provide insight.

Importing Libraries

library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(ggplot2)
library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ stringr   1.5.1
## ✔ forcats   1.0.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Importing Data

library(readr)
dailyActivity <- read_csv("dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dailyCalories <- read_csv("dailyCalories_merged.csv")
## Rows: 940 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, Calories
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sleepDay <- read_csv("sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
weightLogInfo <- read_csv("weightLogInfo_merged.csv")
## Rows: 67 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
## lgl (1): IsManualReport
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Preview data

Take a look at what type of data is collected

head(dailyActivity)
## # A tibble: 6 × 15
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
##        <dbl> <chr>             <dbl>         <dbl>           <dbl>
## 1 1503960366 4/12/2016         13162          8.5             8.5 
## 2 1503960366 4/13/2016         10735          6.97            6.97
## 3 1503960366 4/14/2016         10460          6.74            6.74
## 4 1503960366 4/15/2016          9762          6.28            6.28
## 5 1503960366 4/16/2016         12669          8.16            8.16
## 6 1503960366 4/17/2016          9705          6.48            6.48
## # ℹ 10 more variables: LoggedActivitiesDistance <dbl>,
## #   VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## #   LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## #   VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## #   LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>
head(dailyCalories)
## # A tibble: 6 × 3
##           Id ActivityDay Calories
##        <dbl> <chr>          <dbl>
## 1 1503960366 4/12/2016       1985
## 2 1503960366 4/13/2016       1797
## 3 1503960366 4/14/2016       1776
## 4 1503960366 4/15/2016       1745
## 5 1503960366 4/16/2016       1863
## 6 1503960366 4/17/2016       1728
head(sleepDay)
## # A tibble: 6 × 5
##           Id SleepDay        TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
##        <dbl> <chr>                       <dbl>              <dbl>          <dbl>
## 1 1503960366 4/12/2016 12:0…                 1                327            346
## 2 1503960366 4/13/2016 12:0…                 2                384            407
## 3 1503960366 4/15/2016 12:0…                 1                412            442
## 4 1503960366 4/16/2016 12:0…                 2                340            367
## 5 1503960366 4/17/2016 12:0…                 1                700            712
## 6 1503960366 4/19/2016 12:0…                 1                304            320
head(weightLogInfo)
## # A tibble: 6 × 8
##           Id Date       WeightKg WeightPounds   Fat   BMI IsManualReport   LogId
##        <dbl> <chr>         <dbl>        <dbl> <dbl> <dbl> <lgl>            <dbl>
## 1 1503960366 5/2/2016 …     52.6         116.    22  22.6 TRUE           1.46e12
## 2 1503960366 5/3/2016 …     52.6         116.    NA  22.6 TRUE           1.46e12
## 3 1927972279 4/13/2016…    134.          294.    NA  47.5 FALSE          1.46e12
## 4 2873212765 4/21/2016…     56.7         125.    NA  21.5 TRUE           1.46e12
## 5 2873212765 5/12/2016…     57.3         126.    NA  21.7 TRUE           1.46e12
## 6 4319703577 4/17/2016…     72.4         160.    25  27.5 TRUE           1.46e12

Identify all columns in the data

colnames(dailyActivity)
##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"
colnames(dailyCalories)
## [1] "Id"          "ActivityDay" "Calories"
colnames(sleepDay)
## [1] "Id"                 "SleepDay"           "TotalSleepRecords" 
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
colnames(weightLogInfo)
## [1] "Id"             "Date"           "WeightKg"       "WeightPounds"  
## [5] "Fat"            "BMI"            "IsManualReport" "LogId"

Process

Identify unique participants in each data frame

n_distinct(dailyActivity$Id)
## [1] 33
n_distinct(sleepDay$Id)
## [1] 24
n_distinct(dailyCalories$Id)
## [1] 33
n_distinct(weightLogInfo$Id)
## [1] 8

Due to weight logs having only 8 participants we will refrain from using that data

Check for duplicates

sum(duplicated(dailyActivity))
## [1] 0
sum(duplicated(sleepDay))
## [1] 3
sum(duplicated(dailyCalories))
## [1] 0

Lets remove the duplicates from sleepDay

sleepDay <- sleepDay %>% 
  distinct() %>% 
  drop_na() 
  sum(duplicated(sleepDay))
## [1] 0

Cleaning data

clean_names(dailyActivity)
## # A tibble: 940 × 15
##            id activity_date total_steps total_distance tracker_distance
##         <dbl> <chr>               <dbl>          <dbl>            <dbl>
##  1 1503960366 4/12/2016           13162           8.5              8.5 
##  2 1503960366 4/13/2016           10735           6.97             6.97
##  3 1503960366 4/14/2016           10460           6.74             6.74
##  4 1503960366 4/15/2016            9762           6.28             6.28
##  5 1503960366 4/16/2016           12669           8.16             8.16
##  6 1503960366 4/17/2016            9705           6.48             6.48
##  7 1503960366 4/18/2016           13019           8.59             8.59
##  8 1503960366 4/19/2016           15506           9.88             9.88
##  9 1503960366 4/20/2016           10544           6.68             6.68
## 10 1503960366 4/21/2016            9819           6.34             6.34
## # ℹ 930 more rows
## # ℹ 10 more variables: logged_activities_distance <dbl>,
## #   very_active_distance <dbl>, moderately_active_distance <dbl>,
## #   light_active_distance <dbl>, sedentary_active_distance <dbl>,
## #   very_active_minutes <dbl>, fairly_active_minutes <dbl>,
## #   lightly_active_minutes <dbl>, sedentary_minutes <dbl>, calories <dbl>
clean_names(sleepDay)
## # A tibble: 410 × 5
##          id sleep_day total_sleep_records total_minutes_asleep total_time_in_bed
##       <dbl> <chr>                   <dbl>                <dbl>             <dbl>
##  1   1.50e9 4/12/201…                   1                  327               346
##  2   1.50e9 4/13/201…                   2                  384               407
##  3   1.50e9 4/15/201…                   1                  412               442
##  4   1.50e9 4/16/201…                   2                  340               367
##  5   1.50e9 4/17/201…                   1                  700               712
##  6   1.50e9 4/19/201…                   1                  304               320
##  7   1.50e9 4/20/201…                   1                  360               377
##  8   1.50e9 4/21/201…                   1                  325               364
##  9   1.50e9 4/23/201…                   1                  361               384
## 10   1.50e9 4/24/201…                   1                  430               449
## # ℹ 400 more rows
clean_names(dailyCalories)
## # A tibble: 940 × 3
##            id activity_day calories
##         <dbl> <chr>           <dbl>
##  1 1503960366 4/12/2016        1985
##  2 1503960366 4/13/2016        1797
##  3 1503960366 4/14/2016        1776
##  4 1503960366 4/15/2016        1745
##  5 1503960366 4/16/2016        1863
##  6 1503960366 4/17/2016        1728
##  7 1503960366 4/18/2016        1921
##  8 1503960366 4/19/2016        2035
##  9 1503960366 4/20/2016        1786
## 10 1503960366 4/21/2016        1775
## # ℹ 930 more rows

Formatting data

Having the date as a character could cause problems so we will properly format that in each data

dailyActivity <- dailyActivity %>% 
  mutate(ActivityDate= as_date(ActivityDate, format= "%m/%d/%Y")) %>% 
  rename(date= ActivityDate)
sleepDay <- sleepDay %>% 
  mutate(SleepDay= as_date(SleepDay, format= "%m/%d/%Y %I:%M:%S %p")) %>% 
  rename(date= SleepDay)
dailyCalories <- dailyCalories %>% 
  mutate(ActivityDay= as_date(ActivityDay, format= "%m/%d/%Y")) %>% 
  rename(date= ActivityDay)

Now we check to make sure it is all properly formatted

str(dailyActivity)
## tibble [940 × 15] (S3: tbl_df/tbl/data.frame)
##  $ Id                      : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ date                    : Date[1:940], format: "2016-04-12" "2016-04-13" ...
##  $ TotalSteps              : num [1:940] 13162 10735 10460 9762 12669 ...
##  $ TotalDistance           : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : num [1:940] 728 776 1218 726 773 ...
##  $ Calories                : num [1:940] 1985 1797 1776 1745 1863 ...
str(sleepDay)
## tibble [410 × 5] (S3: tbl_df/tbl/data.frame)
##  $ Id                : num [1:410] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ date              : Date[1:410], format: "2016-04-12" "2016-04-13" ...
##  $ TotalSleepRecords : num [1:410] 1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: num [1:410] 327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : num [1:410] 346 407 442 367 712 320 377 364 384 449 ...
str(dailyCalories)
## tibble [940 × 3] (S3: tbl_df/tbl/data.frame)
##  $ Id      : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ date    : Date[1:940], format: "2016-04-12" "2016-04-13" ...
##  $ Calories: num [1:940] 1985 1797 1776 1745 1863 ...

Analyze

Summary Statistics

dailyActivity %>% 
  select(TotalSteps, TotalDistance, SedentaryMinutes) %>% 
  summary()
##    TotalSteps    TotalDistance    SedentaryMinutes
##  Min.   :    0   Min.   : 0.000   Min.   :   0.0  
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 729.8  
##  Median : 7406   Median : 5.245   Median :1057.5  
##  Mean   : 7638   Mean   : 5.490   Mean   : 991.2  
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:1229.5  
##  Max.   :36019   Max.   :28.030   Max.   :1440.0

Using the median due to potential outliers we can see that on average people walk 7406 steps a day and travela distance of 5.245 miles. Sedentary minutes spent on average is 1057.5 or 17.6 hrs a day spent sedentary.

sleepDay %>% 
  select(TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed) %>% 
  summary()
##  TotalSleepRecords TotalMinutesAsleep TotalTimeInBed 
##  Min.   :1.00      Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:1.00      1st Qu.:361.0      1st Qu.:403.8  
##  Median :1.00      Median :432.5      Median :463.0  
##  Mean   :1.12      Mean   :419.2      Mean   :458.5  
##  3rd Qu.:1.00      3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :3.00      Max.   :796.0      Max.   :961.0

Using the median due to potential outliers we can see that the average person has 432.5 minutes or 7 hours and 12.5 minutes of time asleep and 463 minutes of time in bed. Subtracting time in bed from total time asleep gives us an average of 30.5 minutes of time spent awake in bed.

dailyCalories %>% 
  select(Calories) %>% 
  summary()
##     Calories   
##  Min.   :   0  
##  1st Qu.:1828  
##  Median :2134  
##  Mean   :2304  
##  3rd Qu.:2793  
##  Max.   :4900

The average amount of calories burned in a day by the users is 2134

dailyActivity %>% 
  select(VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes) %>% 
  summary()
##  VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes
##  Min.   :  0.00    Min.   :  0.00      Min.   :  0.0       
##  1st Qu.:  0.00    1st Qu.:  0.00      1st Qu.:127.0       
##  Median :  4.00    Median :  6.00      Median :199.0       
##  Mean   : 21.16    Mean   : 13.56      Mean   :192.8       
##  3rd Qu.: 32.00    3rd Qu.: 19.00      3rd Qu.:264.0       
##  Max.   :210.00    Max.   :143.00      Max.   :518.0

We can see here the average time of each intensity with very active at 4 minutes average, fairly at 6 minutes active, and, light at 199 minutes active. Higher intensity results in less active minutes on average.

Merging data

calorie_activity <- merge(dailyActivity, dailyCalories, by= c("Id","date","Calories"))
all_data <- merge(calorie_activity, sleepDay, by= c("Id","date"))
head(all_data)
##           Id       date Calories TotalSteps TotalDistance TrackerDistance
## 1 1503960366 2016-04-12     1985      13162          8.50            8.50
## 2 1503960366 2016-04-13     1797      10735          6.97            6.97
## 3 1503960366 2016-04-15     1745       9762          6.28            6.28
## 4 1503960366 2016-04-16     1863      12669          8.16            8.16
## 5 1503960366 2016-04-17     1728       9705          6.48            6.48
## 6 1503960366 2016-04-19     2035      15506          9.88            9.88
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.14                     1.26
## 4                        0               2.71                     0.41
## 5                        0               3.19                     0.78
## 6                        0               3.53                     1.32
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                2.83                       0                29
## 4                5.04                       0                36
## 5                2.51                       0                38
## 6                5.03                       0                50
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes TotalSleepRecords
## 1                  13                  328              728                 1
## 2                  19                  217              776                 2
## 3                  34                  209              726                 1
## 4                  10                  221              773                 2
## 5                  20                  164              539                 1
## 6                  31                  264              775                 1
##   TotalMinutesAsleep TotalTimeInBed
## 1                327            346
## 2                384            407
## 3                412            442
## 4                340            367
## 5                700            712
## 6                304            320

Share

ggplot(data = all_data, aes(x= TotalSteps, y=SedentaryMinutes)) + geom_point() +
  geom_smooth()+
  labs(title = "Sedentary Minutes vs Total Steps")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

This plot point shows majority of the group all spend about the same time sedentary with total steps being more wide in range.

ggplot(data = all_data, aes(x=TotalMinutesAsleep, y= TotalTimeInBed)) + geom_point()+
  labs(title = "Time in Bed vs Time Asleep")

This plot gives us insight on how majority of the individuals spend more time in bed then they do asleep

average <- data.frame(
  ActivityLevel = c("High", "Fair","Light"),
  AverageMinutes = c(4,6,199))

ggplot(data = average, aes(x = ActivityLevel, y= AverageMinutes, color = ActivityLevel , fill = ActivityLevel )) + geom_bar( stat = "identity") + theme(legend.position = "none")

Using the summary data we can create a bar graph showing average minutes compared to activity level. This shows majority of time spent being active is mostly light which could be a result of every day activity.

Act

Findings

In conclusion I found that majority of participants spent most of their day sedentary. They sleep on average 7 hours and spend roughly 30 minutes awake in bed on average. The time they do spend on activities is limited with light activity being the highest average of the three followed by fairly active and high activity being the shortest amount of time on average.

Recommendations

With this data I can recommend using the bellabeats app to encourage users on spending more time of the day being active. Alerts and reminders that ping the smart devices like the Time or Leaf with goals to motivate individuals in staying active. Improving sleep would also be a good way to influence Bellabeats marketing having a dim light option on the app to help users fall asleep faster with no distractions.