Bellabeat Case Study with R

About the company

Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused smart products. Sršen used her background as an artist to develop beautifully designed technology that informs and inspires women around the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits.

Questions for analysis

What are some trends in smart device usage?
How could these trends apply to Bellabeat customers?
How could these trends help influence Bellabeat marketing strategy?

Installing and loading packages

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.1
## ✔ readr   2.1.2     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(lubridate)

## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(skimr)
library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

Importing .csv dataset files

In this case study, datasets are provide by FitBit Fitness Tracker Data

activity <- read_csv("dailyActivity_merged.csv")
calories <- read_csv("dailyCalories_merged.csv")
heartrate <- read_csv("heartrate_seconds_merged.csv")
intensities <- read_csv("dailyIntensities_merged.csv")
sleep <- read_csv("sleepDay_merged.csv")
steps <- read_csv("dailySteps_merged.csv")
steps_hourly <- read_csv("hourlySteps_merged.csv")
weight <- read_csv("weightLogInfo_merged.csv")

Checking datasets

After we have import the relevant datasets into our working environment, let’s observe the basic structure of our dataset.

head(activity)

## # A tibble: 6 × 15
##       Id Activ…¹ Total…² Total…³ Track…⁴ Logge…⁵ VeryA…⁶ Moder…⁷ Light…⁸ Seden…⁹
##    <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 1.50e9 4/12/2…   13162    8.5     8.5        0    1.88   0.550    6.06       0
## 2 1.50e9 4/13/2…   10735    6.97    6.97       0    1.57   0.690    4.71       0
## 3 1.50e9 4/14/2…   10460    6.74    6.74       0    2.44   0.400    3.91       0
## 4 1.50e9 4/15/2…    9762    6.28    6.28       0    2.14   1.26     2.83       0
## 5 1.50e9 4/16/2…   12669    8.16    8.16       0    2.71   0.410    5.04       0
## 6 1.50e9 4/17/2…    9705    6.48    6.48       0    3.19   0.780    2.51       0
## # … with 5 more variables: VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## #   LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>, and
## #   abbreviated variable names ¹ActivityDate, ²TotalSteps, ³TotalDistance,
## #   ⁴TrackerDistance, ⁵LoggedActivitiesDistance, ⁶VeryActiveDistance,
## #   ⁷ModeratelyActiveDistance, ⁸LightActiveDistance, ⁹SedentaryActiveDistance

colSums(is.na(activity))

##                       Id             ActivityDate               TotalSteps 
##                        0                        0                        0 
##            TotalDistance          TrackerDistance LoggedActivitiesDistance 
##                        0                        0                        0 
##       VeryActiveDistance ModeratelyActiveDistance      LightActiveDistance 
##                        0                        0                        0 
##  SedentaryActiveDistance        VeryActiveMinutes      FairlyActiveMinutes 
##                        0                        0                        0 
##     LightlyActiveMinutes         SedentaryMinutes                 Calories 
##                        0                        0                        0

head(calories)

## # A tibble: 6 × 3
##           Id ActivityDay Calories
##        <dbl> <chr>          <dbl>
## 1 1503960366 4/12/2016       1985
## 2 1503960366 4/13/2016       1797
## 3 1503960366 4/14/2016       1776
## 4 1503960366 4/15/2016       1745
## 5 1503960366 4/16/2016       1863
## 6 1503960366 4/17/2016       1728

colSums(is.na(calories))

##          Id ActivityDay    Calories 
##           0           0           0

head(intensities)

## # A tibble: 6 × 10
##       Id Activ…¹ Seden…² Light…³ Fairl…⁴ VeryA…⁵ Seden…⁶ Light…⁷ Moder…⁸ VeryA…⁹
##    <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 1.50e9 4/12/2…     728     328      13      25       0    6.06   0.550    1.88
## 2 1.50e9 4/13/2…     776     217      19      21       0    4.71   0.690    1.57
## 3 1.50e9 4/14/2…    1218     181      11      30       0    3.91   0.400    2.44
## 4 1.50e9 4/15/2…     726     209      34      29       0    2.83   1.26     2.14
## 5 1.50e9 4/16/2…     773     221      10      36       0    5.04   0.410    2.71
## 6 1.50e9 4/17/2…     539     164      20      38       0    2.51   0.780    3.19
## # … with abbreviated variable names ¹ActivityDay, ²SedentaryMinutes,
## #   ³LightlyActiveMinutes, ⁴FairlyActiveMinutes, ⁵VeryActiveMinutes,
## #   ⁶SedentaryActiveDistance, ⁷LightActiveDistance, ⁸ModeratelyActiveDistance,
## #   ⁹VeryActiveDistance

colSums(is.na(intensities))

##                       Id              ActivityDay         SedentaryMinutes 
##                        0                        0                        0 
##     LightlyActiveMinutes      FairlyActiveMinutes        VeryActiveMinutes 
##                        0                        0                        0 
##  SedentaryActiveDistance      LightActiveDistance ModeratelyActiveDistance 
##                        0                        0                        0 
##       VeryActiveDistance 
##                        0

head(steps)

## # A tibble: 6 × 3
##           Id ActivityDay StepTotal
##        <dbl> <chr>           <dbl>
## 1 1503960366 4/12/2016       13162
## 2 1503960366 4/13/2016       10735
## 3 1503960366 4/14/2016       10460
## 4 1503960366 4/15/2016        9762
## 5 1503960366 4/16/2016       12669
## 6 1503960366 4/17/2016        9705

colSums(is.na(steps))

##          Id ActivityDay   StepTotal 
##           0           0           0

head(weight)

## # A tibble: 6 × 8
##           Id Date                  WeightKg Weight…¹   Fat   BMI IsMan…²   LogId
##        <dbl> <chr>                    <dbl>    <dbl> <dbl> <dbl> <lgl>     <dbl>
## 1 1503960366 5/2/2016 11:59:59 PM      52.6     116.    22  22.6 TRUE    1.46e12
## 2 1503960366 5/3/2016 11:59:59 PM      52.6     116.    NA  22.6 TRUE    1.46e12
## 3 1927972279 4/13/2016 1:08:52 AM     134.      294.    NA  47.5 FALSE   1.46e12
## 4 2873212765 4/21/2016 11:59:59 PM     56.7     125.    NA  21.5 TRUE    1.46e12
## 5 2873212765 5/12/2016 11:59:59 PM     57.3     126.    NA  21.7 TRUE    1.46e12
## 6 4319703577 4/17/2016 11:59:59 PM     72.4     160.    25  27.5 TRUE    1.46e12
## # … with abbreviated variable names ¹WeightPounds, ²IsManualReport

colSums(is.na(weight))

##             Id           Date       WeightKg   WeightPounds            Fat 
##              0              0              0              0             65 
##            BMI IsManualReport          LogId 
##              0              0              0

“FAT” has the majority of the data missing, so we will not be using FAT in this project.

Checking for duplicates

sum(duplicated(activity))

## [1] 0

sum(duplicated(calories))

## [1] 0

sum(duplicated(heartrate))

## [1] 0

sum(duplicated(intensities))

## [1] 0

sum(duplicated(sleep))

## [1] 3

sum(duplicated(steps))

## [1] 0

sum(duplicated(steps_hourly))

## [1] 0

sum(duplicated(weight))

## [1] 0

Removing duplicates

Since there are duplicates in sleep dataset, we have to remove them.

sleep <- sleep %>%
  distinct()
# recheck for duplicates
sum(duplicated(sleep))

## [1] 0

After running the code and rechecked, the duplicates were removed.

Checking sample size

n_distinct(activity$Id)

## [1] 33

n_distinct(calories$Id)

## [1] 33

n_distinct(heartrate$Id)

## [1] 14

n_distinct(intensities$Id)

## [1] 33

n_distinct(sleep$Id)

## [1] 24

n_distinct(steps$Id)

## [1] 33

n_distinct(steps_hourly$Id)

## [1] 33

n_distinct(weight$Id)

## [1] 8

This tells us about the number of user’s data collected. Heartrate has 14 users and weight dataset has 8, so we will not be using them for this project.

Cleaning columns name

We will be converting the columns naming to snake_case format across all tables.

clean_names(activity)

## # A tibble: 940 × 15
##            id activity…¹ total…² total…³ track…⁴ logge…⁵ very_…⁶ moder…⁷ light…⁸
##         <dbl> <chr>        <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 1503960366 4/12/2016    13162    8.5     8.5        0    1.88   0.550    6.06
##  2 1503960366 4/13/2016    10735    6.97    6.97       0    1.57   0.690    4.71
##  3 1503960366 4/14/2016    10460    6.74    6.74       0    2.44   0.400    3.91
##  4 1503960366 4/15/2016     9762    6.28    6.28       0    2.14   1.26     2.83
##  5 1503960366 4/16/2016    12669    8.16    8.16       0    2.71   0.410    5.04
##  6 1503960366 4/17/2016     9705    6.48    6.48       0    3.19   0.780    2.51
##  7 1503960366 4/18/2016    13019    8.59    8.59       0    3.25   0.640    4.71
##  8 1503960366 4/19/2016    15506    9.88    9.88       0    3.53   1.32     5.03
##  9 1503960366 4/20/2016    10544    6.68    6.68       0    1.96   0.480    4.24
## 10 1503960366 4/21/2016     9819    6.34    6.34       0    1.34   0.350    4.65
## # … with 930 more rows, 6 more variables: sedentary_active_distance <dbl>,
## #   very_active_minutes <dbl>, fairly_active_minutes <dbl>,
## #   lightly_active_minutes <dbl>, sedentary_minutes <dbl>, calories <dbl>, and
## #   abbreviated variable names ¹activity_date, ²total_steps, ³total_distance,
## #   ⁴tracker_distance, ⁵logged_activities_distance, ⁶very_active_distance,
## #   ⁷moderately_active_distance, ⁸light_active_distance

activity <- rename_with(activity, tolower)
clean_names(calories)

## # A tibble: 940 × 3
##            id activity_day calories
##         <dbl> <chr>           <dbl>
##  1 1503960366 4/12/2016        1985
##  2 1503960366 4/13/2016        1797
##  3 1503960366 4/14/2016        1776
##  4 1503960366 4/15/2016        1745
##  5 1503960366 4/16/2016        1863
##  6 1503960366 4/17/2016        1728
##  7 1503960366 4/18/2016        1921
##  8 1503960366 4/19/2016        2035
##  9 1503960366 4/20/2016        1786
## 10 1503960366 4/21/2016        1775
## # … with 930 more rows

calories <- rename_with(calories, tolower)
clean_names(intensities)

## # A tibble: 940 × 10
##            id activity…¹ seden…² light…³ fairl…⁴ very_…⁵ seden…⁶ light…⁷ moder…⁸
##         <dbl> <chr>        <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 1503960366 4/12/2016      728     328      13      25       0    6.06   0.550
##  2 1503960366 4/13/2016      776     217      19      21       0    4.71   0.690
##  3 1503960366 4/14/2016     1218     181      11      30       0    3.91   0.400
##  4 1503960366 4/15/2016      726     209      34      29       0    2.83   1.26 
##  5 1503960366 4/16/2016      773     221      10      36       0    5.04   0.410
##  6 1503960366 4/17/2016      539     164      20      38       0    2.51   0.780
##  7 1503960366 4/18/2016     1149     233      16      42       0    4.71   0.640
##  8 1503960366 4/19/2016      775     264      31      50       0    5.03   1.32 
##  9 1503960366 4/20/2016      818     205      12      28       0    4.24   0.480
## 10 1503960366 4/21/2016      838     211       8      19       0    4.65   0.350
## # … with 930 more rows, 1 more variable: very_active_distance <dbl>, and
## #   abbreviated variable names ¹activity_day, ²sedentary_minutes,
## #   ³lightly_active_minutes, ⁴fairly_active_minutes, ⁵very_active_minutes,
## #   ⁶sedentary_active_distance, ⁷light_active_distance,
## #   ⁸moderately_active_distance

intensities <- rename_with(intensities, tolower)
clean_names(sleep)

## # A tibble: 410 × 5
##            id sleep_day             total_sleep_records total_minutes_…¹ total…²
##         <dbl> <chr>                               <dbl>            <dbl>   <dbl>
##  1 1503960366 4/12/2016 12:00:00 AM                   1              327     346
##  2 1503960366 4/13/2016 12:00:00 AM                   2              384     407
##  3 1503960366 4/15/2016 12:00:00 AM                   1              412     442
##  4 1503960366 4/16/2016 12:00:00 AM                   2              340     367
##  5 1503960366 4/17/2016 12:00:00 AM                   1              700     712
##  6 1503960366 4/19/2016 12:00:00 AM                   1              304     320
##  7 1503960366 4/20/2016 12:00:00 AM                   1              360     377
##  8 1503960366 4/21/2016 12:00:00 AM                   1              325     364
##  9 1503960366 4/23/2016 12:00:00 AM                   1              361     384
## 10 1503960366 4/24/2016 12:00:00 AM                   1              430     449
## # … with 400 more rows, and abbreviated variable names ¹total_minutes_asleep,
## #   ²total_time_in_bed

sleep <- rename_with(sleep, tolower)
clean_names(steps)

## # A tibble: 940 × 3
##            id activity_day step_total
##         <dbl> <chr>             <dbl>
##  1 1503960366 4/12/2016         13162
##  2 1503960366 4/13/2016         10735
##  3 1503960366 4/14/2016         10460
##  4 1503960366 4/15/2016          9762
##  5 1503960366 4/16/2016         12669
##  6 1503960366 4/17/2016          9705
##  7 1503960366 4/18/2016         13019
##  8 1503960366 4/19/2016         15506
##  9 1503960366 4/20/2016         10544
## 10 1503960366 4/21/2016          9819
## # … with 930 more rows

steps <- rename_with(steps, tolower)
clean_names(steps_hourly)

## # A tibble: 22,099 × 3
##            id activity_hour         step_total
##         <dbl> <chr>                      <dbl>
##  1 1503960366 4/12/2016 12:00:00 AM        373
##  2 1503960366 4/12/2016 1:00:00 AM         160
##  3 1503960366 4/12/2016 2:00:00 AM         151
##  4 1503960366 4/12/2016 3:00:00 AM           0
##  5 1503960366 4/12/2016 4:00:00 AM           0
##  6 1503960366 4/12/2016 5:00:00 AM           0
##  7 1503960366 4/12/2016 6:00:00 AM           0
##  8 1503960366 4/12/2016 7:00:00 AM           0
##  9 1503960366 4/12/2016 8:00:00 AM         250
## 10 1503960366 4/12/2016 9:00:00 AM        1864
## # … with 22,089 more rows

steps_hourly <- rename_with(steps_hourly, tolower)

Formatting date and time

From Checking datasets topic, I spotted that all dates were in chr format.

# activity
activity <- activity %>%
  rename(date = activitydate) %>%
  mutate(date = as_date(date, format = "%m/%d/%Y"))
# calories
calories <- calories %>%
  rename(date = activityday) %>%
  mutate(date = as_date(date, format = "%m/%d/%Y"))
# intensities
intensities <- intensities %>%
  rename(date = activityday) %>%
  mutate(date = as_date(date, format = "%m/%d/%Y"))
# sleep
sleep <- sleep %>%
  rename(date = sleepday) %>%
  mutate(date = as_date(date, format = "%m/%d/%Y"))
# steps
steps <- steps %>%
  rename(date = activityday) %>%
  mutate(date = as_date(date, format = "%m/%d/%Y"))
# steps_hourly
steps_hourly <- steps_hourly %>%
  rename(date_time = activityhour) %>%
  mutate(date_time = mdy_hms(date_time))

Since I will be merging some of the datasets, I want all of it to be consistent to avoid any future errors.

Seperating Date and Time

steps_hourly <- steps_hourly %>%
  separate(date_time, into = c("date", "time"), sep= " ") %>%
  mutate(date = ymd(date))

Separating date and time will make analyzing easier, such as grouping by date or time.

Merging and adding columns in datasets

activity_sleep <- merge(activity, sleep, by = c("id", "date"))
calories_intensities <- merge(calories, intensities, by = c("id", "date"))
calories_intensities$totalminutes <- calories_intensities$lightlyactiveminutes + calories_intensities$fairlyactiveminutes + calories_intensities$veryactiveminutes
calories_steps <- merge(calories, steps, by = c("id", "date"))

Visualizations

ggplot(data = calories_intensities, mapping = aes(x = totalminutes, y = calories)) +
  geom_jitter() + geom_smooth(method = lm) + labs(title = "Active Minutes VS Calories")

## `geom_smooth()` using formula 'y ~ x'

This plot shows a positive relation between total active minutes and calories burned.

ggplot(data = calories_steps, mapping = aes(x = steptotal, y = calories)) +
  geom_point() + geom_smooth(method = lm) + labs(title = "Total Steps VS Calories")

## `geom_smooth()` using formula 'y ~ x'

This help confirms the first graph, the more active users are, the more calories are burned.

ggplot(data = activity, mapping = aes(x = totalsteps, y = totaldistance)) +
  geom_point() + geom_smooth(method = lm) + labs(title = "Total Steps VS Total Distance")

## `geom_smooth()` using formula 'y ~ x'

This graph is to assures that the tracker is functioning, more steps means that more distance users travelled.

ggplot(data = activity_sleep, mapping = aes(x = totalminutesasleep, y = sedentaryminutes )) +
  geom_point() + geom_smooth(method = lm) + labs(title = "Total Minutes Asleep VS Sedentary Minutes")

## `geom_smooth()` using formula 'y ~ x'

As we can see the negative relationship between these two, we can assume that people that tend to have higher sedentary minutes will sleep less. It could mean that they tend to work more.

steps_hourly %>%
  group_by(time) %>%
  summarize(avg_steps = mean(steptotal)) %>%
  ggplot() +
  geom_bar(mapping = aes(x = time, y = avg_steps), stat = "identity") +
  labs(title = "Average Steps Hourly") +
  theme(axis.text.x = element_text(angle = 45))

As we can see, people tend to have most steps during lunch time and after office hours.

Recommendations

We could recommend Bellabeat app to have more excercising programs for users
As we see calories are burned more as users take more steps, the app could have a notification to encourage users to walk or run more.
Most of the steps taken are during lunch and after office hour, Bellabeat could use these time frame to set a campaign or notification to motivate users.
If users want to improve their sleep, they could consider reducing sedentary time.

This is my first R case study project. Thank you!