Bellabeat is a manufacturer of health-focused smart products for women. The company’s app and multiple smart devices collect data on activity, sleep, stress, hydration levels, and reproductive health to empower women with an understanding of their own health and habits. The company was founded in 2013 by Urška Sršen and Sando Mur and has expanded quickly since, now with the possibility to become a greater player in the global smart device market. Bellabeat membership is a subscription-based membership program that provides users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health, beauty, and mindfulness based on their lifestyle and goals.
The phases of the data analysis process (Ask, Prepare, Process, Analyze, Share and Act) are to be used in the analysis of the datasets.
Guiding Questions:
What is the problem you are trying to solve?
What are the trends in smart device usage and which of these are Bellabeat currently implementing.
How can your insights drive business decisions?
The identified trends should guide the company’s marketing strategy.
Business Task:
Stakeholders:
Information on the Data Source:
Limitations of the Dataset:
Is Data ROCCC:
The data integrity and credibility is not sufficient to provide a comprehensive analysis of the company. Therefore, the following analysis can only provide high-level direction and should be verified with
The analysis is to be carried out using RStudio. First, all relevant packages are loaded.
library("tidyverse")
library("here")
library("janitor")
library("skimr")
library("dplyr")
library("lubridate")
library("tidyr")
library("ggplot2")
As there is only a small sample size of the datasets, it is prudent focus on the “daily” data which is more likely to provide high level insights.
daily_activity <- read_csv("dailyActivity_merged.csv")
daily_calories <- read_csv("dailyCalories_merged.csv")
daily_intensities <- read_csv("dailyIntensities_merged.csv")
daily_steps <- read_csv("dailySteps_merged.csv")
daily_sleep <- read_csv("sleepDay_merged.csv")
Review the data to look for useful starting points and patterns.
glimpse(daily_activity)
## Rows: 940
## Columns: 15
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036~
## $ ActivityDate <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/~
## $ TotalSteps <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019~
## $ TotalDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ TrackerDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5~
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3~
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0~
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveMinutes <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4~
## $ FairlyActiveMinutes <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21~
## $ LightlyActiveMinutes <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, ~
## $ SedentaryMinutes <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818~
## $ Calories <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203~
glimpse(daily_calories)
## Rows: 940
## Columns: 3
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960366~
## $ ActivityDay <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/2016", "4/16/~
## $ Calories <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2035, 1786, 1775~
glimpse(daily_intensities)
## Rows: 940
## Columns: 10
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036~
## $ ActivityDay <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/~
## $ SedentaryMinutes <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818~
## $ LightlyActiveMinutes <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, ~
## $ FairlyActiveMinutes <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21~
## $ VeryActiveMinutes <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4~
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0~
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3~
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5~
glimpse(daily_steps)
## Rows: 940
## Columns: 3
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960366~
## $ ActivityDay <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/2016", "4/16/~
## $ StepTotal <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019, 15506, 1054~
glimpse(daily_sleep)
## Rows: 413
## Columns: 5
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150~
## $ SleepDay <chr> "4/12/2016 12:00:00 AM", "4/13/2016 12:00:00 AM", "~
## $ TotalSleepRecords <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ TotalMinutesAsleep <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2~
## $ TotalTimeInBed <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3~
Check to see how many distinct ID’s (users) are in each data set.
n_distinct(daily_activity$Id)
## [1] 33
n_distinct(daily_calories$Id)
## [1] 33
n_distinct(daily_intensities$Id)
## [1] 33
n_distinct(daily_steps$Id)
## [1] 33
n_distinct(daily_sleep$Id)
## [1] 24
The sleep data has 9 fewer users.All other data sets have 33 users.
Following a closer inspection, the daily_activity data set contains the relevant data found in daily_calories, daily_intensities and daily_steps. Therefore, the focus can be dedicated to the daily_activity and daily_sleep data sets.
Check for (and remove) duplicates. The sleep data had three identical rows which were removed.
distinct(daily_activity)
## # A tibble: 940 x 15
## Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie~
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2016 13162 8.5 8.5 0
## 2 1.50e9 4/13/2016 10735 6.97 6.97 0
## 3 1.50e9 4/14/2016 10460 6.74 6.74 0
## 4 1.50e9 4/15/2016 9762 6.28 6.28 0
## 5 1.50e9 4/16/2016 12669 8.16 8.16 0
## 6 1.50e9 4/17/2016 9705 6.48 6.48 0
## 7 1.50e9 4/18/2016 13019 8.59 8.59 0
## 8 1.50e9 4/19/2016 15506 9.88 9.88 0
## 9 1.50e9 4/20/2016 10544 6.68 6.68 0
## 10 1.50e9 4/21/2016 9819 6.34 6.34 0
## # ... with 930 more rows, and 9 more variables: VeryActiveDistance <dbl>,
## # ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## # SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## # FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## # SedentaryMinutes <dbl>, Calories <dbl>
daily_sleep_updated <- distinct(daily_sleep)
Next, the date columns are to be converted to the correct “date” format.
daily_activity$ActivityDate <-as.Date(daily_activity$ActivityDate, format = "%m/%d/%Y")
The sleep data has time included in the date column. Prior to formatting the column, it is to be split so that the date is on its own.The time data is not necessary and so is removed. Two additional columns were added with the minutes asleep and minutes in bed converted to hours.
sleep_day_2 <- separate(daily_sleep_updated, col = SleepDay, into = c("Date", "Time", "Am/Pm"), sep = " ")
sleep_day_2$Date <-as.Date(sleep_day_2$Date, format = "%m/%d/%Y")
sleep_day_2$`Am/Pm` <- NULL
sleep_day_2$Time <- NULL
sleep_day_2$TotalHoursAsleep <- sleep_day_2$TotalMinutesAsleep/60
sleep_day_2$TotalHoursInBed <- sleep_day_2$TotalTimeInBed/60
Check through each table to see if they contain potentially invalid ID numbers (i.e. less than 10 digits)
subset(daily_activity, nchar(daily_activity$Id) < 10 | nchar(daily_activity$Id) > 10)
## # A tibble: 0 x 15
## # ... with 15 variables: Id <dbl>, ActivityDate <date>, TotalSteps <dbl>,
## # TotalDistance <dbl>, TrackerDistance <dbl>, LoggedActivitiesDistance <dbl>,
## # VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## # LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## # VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## # LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>
subset(daily_sleep_updated, nchar(daily_sleep_updated$Id) < 10 | nchar(daily_sleep_updated$Id) > 10)
## # A tibble: 0 x 5
## # ... with 5 variables: Id <dbl>, SleepDay <chr>, TotalSleepRecords <dbl>,
## # TotalMinutesAsleep <dbl>, TotalTimeInBed <dbl>
No invalid ID numbers.
Check for rows with NA or missing info.
drop_na(daily_activity)
## # A tibble: 940 x 15
## Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie~
## <dbl> <date> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 2016-04-12 13162 8.5 8.5 0
## 2 1.50e9 2016-04-13 10735 6.97 6.97 0
## 3 1.50e9 2016-04-14 10460 6.74 6.74 0
## 4 1.50e9 2016-04-15 9762 6.28 6.28 0
## 5 1.50e9 2016-04-16 12669 8.16 8.16 0
## 6 1.50e9 2016-04-17 9705 6.48 6.48 0
## 7 1.50e9 2016-04-18 13019 8.59 8.59 0
## 8 1.50e9 2016-04-19 15506 9.88 9.88 0
## 9 1.50e9 2016-04-20 10544 6.68 6.68 0
## 10 1.50e9 2016-04-21 9819 6.34 6.34 0
## # ... with 930 more rows, and 9 more variables: VeryActiveDistance <dbl>,
## # ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## # SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## # FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## # SedentaryMinutes <dbl>, Calories <dbl>
drop_na(daily_sleep_updated)
## # A tibble: 410 x 5
## Id SleepDay TotalSleepRecor~ TotalMinutesAsl~ TotalTimeInBed
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:00:~ 1 327 346
## 2 1503960366 4/13/2016 12:00:~ 2 384 407
## 3 1503960366 4/15/2016 12:00:~ 1 412 442
## 4 1503960366 4/16/2016 12:00:~ 2 340 367
## 5 1503960366 4/17/2016 12:00:~ 1 700 712
## 6 1503960366 4/19/2016 12:00:~ 1 304 320
## 7 1503960366 4/20/2016 12:00:~ 1 360 377
## 8 1503960366 4/21/2016 12:00:~ 1 325 364
## 9 1503960366 4/23/2016 12:00:~ 1 361 384
## 10 1503960366 4/24/2016 12:00:~ 1 430 449
## # ... with 400 more rows
Some of the steps and distance results may be 0, which would likely indicate the device was not used that day. As this may skew results, these values were removed.
daily_activity <- daily_activity %>% filter(TotalSteps !=0)
daily_activity <- daily_activity %>% filter(TotalDistance !=0)
Add a day of the week column to the activity data, so that potential patterns can be evaluated.
daily_activity$Day <- weekdays(daily_activity$ActivityDate)
colnames(daily_activity)
## [1] "Id" "ActivityDate"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories" "Day"
A summary of several variables related to the activity and sleep data is shown below.
daily_activity %>%
select(TotalSteps, TotalDistance, SedentaryMinutes, LightlyActiveMinutes, FairlyActiveMinutes, VeryActiveMinutes, Calories) %>%
summary()
## TotalSteps TotalDistance SedentaryMinutes LightlyActiveMinutes
## Min. : 8 Min. : 0.010 Min. : 0.0 Min. : 0.0
## 1st Qu.: 4927 1st Qu.: 3.373 1st Qu.: 721.2 1st Qu.:147.0
## Median : 8054 Median : 5.590 Median :1020.5 Median :208.5
## Mean : 8329 Mean : 5.986 Mean : 955.2 Mean :210.3
## 3rd Qu.:11096 3rd Qu.: 7.905 3rd Qu.:1189.0 3rd Qu.:272.0
## Max. :36019 Max. :28.030 Max. :1440.0 Max. :518.0
## FairlyActiveMinutes VeryActiveMinutes Calories
## Min. : 0.00 Min. : 0.00 Min. : 52
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:1857
## Median : 8.00 Median : 7.00 Median :2220
## Mean : 14.79 Mean : 23.04 Mean :2362
## 3rd Qu.: 21.00 3rd Qu.: 35.00 3rd Qu.:2832
## Max. :143.00 Max. :210.00 Max. :4900
sleep_day_2 %>%
select(TotalHoursAsleep,
TotalHoursInBed) %>%
summary()
## TotalHoursAsleep TotalHoursInBed
## Min. : 0.9667 Min. : 1.017
## 1st Qu.: 6.0167 1st Qu.: 6.729
## Median : 7.2083 Median : 7.717
## Mean : 6.9862 Mean : 7.641
## 3rd Qu.: 8.1667 3rd Qu.: 8.767
## Max. :13.2667 Max. :16.017
Observations:
Several insights and suggestions, based on the analysis, is summarized below for the Shareholders:
Based on the limitations of the FitBit data, here are some recommendations for future analysis: