Bellabeat - How Can a Wellness Technology Company Play It Smart?

## Introduction and background Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused smart products. Sršen used her background as an artist to develop beautifully designed technology that informs and inspires women around the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits. Since it was founded in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women. By 2016, Bellabeat had opened offices around the world and launched multiple products. Bellabeat products became available through a growing number of online retailers in addition to their own e-commerce channel on their website. The company has invested in traditional advertising media, such as radio, out-of-home billboards, print, and television, but focuses on digital marketing extensively. Bellabeat invests year-round in Google Search, maintaining active Facebook and Instagram pages, and consistently engages consumers on Twitter. Additionally, Bellabeat runs video ads on Youtube and display ads on the Google Display Network to support campaigns around key marketing dates.

Key Stakeholders: Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer Sando Mur: Mathematician, Bellabeat’s cofounder and key member of the Bellabeat executive team Bellabeat marketing analytics team: A team of data analysts guiding Bellabeat’s marketing strategy.

Using the Case Study Roadmap as a guide, this analysis will follow the steps of the data analysis process: Ask, Prepare, Process, Analyze, Share, and Act.

Step 1: Ask

Analyze the smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices and select one Bellabeat product to apply these insights to my presentation.

Guiding questions

What is the problem you are trying to solve?
Sršen knows that an analysis of Bellabeat’s available consumer data would reveal more opportunities for growth. She has asked the marketing analytics team to focus on a Bellabeat product and analyze smart device usage data in order to gain insight into how people are already using their smart devices.
How can your insights drive business decisions?
Using this information, would produce high-level recommendations on how these trends can inform Bellabeat marketing strategy. Business Task: Analyze FitBit fitness tracker data to gain insights into how consumers are using the FitBit app and discover trends for Bellabeat marketing strategy.

Business questions:

1. What are some trends in smart device usage?
2. How could these trends apply to Bellabeat customers?
3. How could these trends help influence Bellabeat marketing strategy?

Produce a report with the following deliverables:

A clear summary of the business task
A description of all data sources used
Documentation of any cleaning or manipulation of data
A summary of your analysis
Supporting visualizations and key findings
6, Your top high-level content recommendations based on your analysis

Step 2: Prepare

The co-founder and Chief Creative Officer encourages me to use “public data” that explores smart device users’ daily habits and points me to a specific Kaggle data set. Now I prepared the data for analysis using the “Case Study Roadmap” as a guide:

Guiding questions:

Where is your data stored?
Data is publicly available on Kaggle: FitBit Fitness Tracker Data.
How is the data organized?
Data is stored in 18 csv files.
Are there issues with bias or credibility in this data? Does your data ROCCC?
This Kaggle data set contains personal fitness tracker from 30 fitbit users. They were generated by respondents to a distributed survey via Amazon Mechanical Turk between 12.04.2016-12.05.2016.
How are you addressing licensing, privacy, security, and accessibility?
30 eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. A good data source is ROCCC which stands for Reliable, Original, Comprehensive, Current, and Cited:

Reliable — LOW — Not reliable as it only has 30 respondents Original — LOW — Third party provider (Amazon Mechanical Turk) Comprehensive — MED — Parameters match most of Bellabeat product´s parameters Current — LOW — Data is 5 years old and may not be relevant Cited — LOW — Data collected from third party, hence unknown “Overall, this dataset is considered “bad quality data” and it is not recommended to produce business recommendations based on this data”

How did you verify the data’s integrity?
As data is collected in a survey, we are unable to ascertain its integrity or accuracy.
How does it help you answer your question?
This data explores smart device users’ daily habits. It includes information about daily activity, steps, sleep habits and heart rate, that can be used to explore users’ habits and find some trends.
Are there any problems with the data?
Data was collected 5 years ago in 2016. Users’ daily activity, fitness and sleeping habits, diet and food consumption may have changed since then. Data may not be timely or relevant. Sample size of 30 FitBit users is not representative of the entire fitness population.

Step 3: Process

Process the data by cleaning and ensuring that it is correct, relevant, complete and free of error and outlier.

Upload CSV files to R

I upload the CSV files to my project from the relevant data source:
https://www.kaggle.com/arashnic/fitbit There are many different CSV files in this dataset, but I decided to concentrate in two CSVs: “dailyActivity_merged.csv” and “sleepDay_merged.csv”

Installing and loading common packages and libraries

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.2     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(readr)
library(dplyr)
library(ggplot2)
library(knitr)
library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(rmarkdown)
library(janitor)

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(skimr)

# Import dataset "dailyActivity_merged.csv"
daily_activity <- read_csv("C:\\Users\\MM\\OneDrive\\Documentos\\Fitabase Data 4.12.16-5.12.16\\dailyActivity_merged.csv")

## Rows: 940 Columns: 15

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Import dataset "sleepDay_merged.csv"
daily_sleep <- read_csv("C:\\Users\\MM\\OneDrive\\Documentos\\Fitabase Data 4.12.16-5.12.16\\sleepDay_merged.csv")

## Rows: 413 Columns: 5

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Explore and preview the first 10 rows of data
head(daily_activity, 10)

## # A tibble: 10 x 15
##            Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie~
##         <dbl> <chr>             <dbl>         <dbl>           <dbl>            <dbl>
##  1 1503960366 4/12/2016         13162          8.5             8.5                 0
##  2 1503960366 4/13/2016         10735          6.97            6.97                0
##  3 1503960366 4/14/2016         10460          6.74            6.74                0
##  4 1503960366 4/15/2016          9762          6.28            6.28                0
##  5 1503960366 4/16/2016         12669          8.16            8.16                0
##  6 1503960366 4/17/2016          9705          6.48            6.48                0
##  7 1503960366 4/18/2016         13019          8.59            8.59                0
##  8 1503960366 4/19/2016         15506          9.88            9.88                0
##  9 1503960366 4/20/2016         10544          6.68            6.68                0
## 10 1503960366 4/21/2016          9819          6.34            6.34                0
## # ... with 9 more variables: VeryActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>

head(daily_sleep, 10)

## # A tibble: 10 x 5
##            Id SleepDay              TotalSleepRecor~ TotalMinutesAsl~ TotalTimeInBed
##         <dbl> <chr>                            <dbl>            <dbl>          <dbl>
##  1 1503960366 4/12/2016 12:00:00 AM                1              327            346
##  2 1503960366 4/13/2016 12:00:00 AM                2              384            407
##  3 1503960366 4/15/2016 12:00:00 AM                1              412            442
##  4 1503960366 4/16/2016 12:00:00 AM                2              340            367
##  5 1503960366 4/17/2016 12:00:00 AM                1              700            712
##  6 1503960366 4/19/2016 12:00:00 AM                1              304            320
##  7 1503960366 4/20/2016 12:00:00 AM                1              360            377
##  8 1503960366 4/21/2016 12:00:00 AM                1              325            364
##  9 1503960366 4/23/2016 12:00:00 AM                1              361            384
## 10 1503960366 4/24/2016 12:00:00 AM                1              430            449

# Familiarize with the data and column datatypes
str(daily_activity)

## spec_tbl_df [940 x 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id                      : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : num [1:940] 13162 10735 10460 9762 12669 ...
##  $ TotalDistance           : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : num [1:940] 728 776 1218 726 773 ...
##  $ Calories                : num [1:940] 1985 1797 1776 1745 1863 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   ActivityDate = col_character(),
##   ..   TotalSteps = col_double(),
##   ..   TotalDistance = col_double(),
##   ..   TrackerDistance = col_double(),
##   ..   LoggedActivitiesDistance = col_double(),
##   ..   VeryActiveDistance = col_double(),
##   ..   ModeratelyActiveDistance = col_double(),
##   ..   LightActiveDistance = col_double(),
##   ..   SedentaryActiveDistance = col_double(),
##   ..   VeryActiveMinutes = col_double(),
##   ..   FairlyActiveMinutes = col_double(),
##   ..   LightlyActiveMinutes = col_double(),
##   ..   SedentaryMinutes = col_double(),
##   ..   Calories = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

str(daily_sleep)

## spec_tbl_df [413 x 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id                : num [1:413] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ SleepDay          : chr [1:413] "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
##  $ TotalSleepRecords : num [1:413] 1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: num [1:413] 327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : num [1:413] 346 407 442 367 712 320 377 364 384 449 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   SleepDay = col_character(),
##   ..   TotalSleepRecords = col_double(),
##   ..   TotalMinutesAsleep = col_double(),
##   ..   TotalTimeInBed = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Understanding some summary statistics

How many unique participants are there in each dataframe? It looks like there may be more participants in the daily activity dataset than the sleep dataset.

n_distinct(daily_activity$Id)

## [1] 33

n_distinct(daily_sleep$Id)

## [1] 24

# Check for missing values
sum(is.na(daily_activity))

## [1] 0

sum(is.na(daily_sleep))

## [1] 0

#Check for duplicates
sum(duplicated(daily_activity))

## [1] 0

sum(duplicated(daily_sleep))

## [1] 3

# Remove duplicates and NA from table 2
daily_sleep <- daily_sleep %>% 
  distinct() %>% 
  drop_na()

# Check duplicates were removed from tables
sum(duplicated(daily_activity))

## [1] 0

sum(duplicated(daily_sleep))

## [1] 0

# Cleaning dataset 1
clean_names(daily_activity)

## # A tibble: 940 x 15
##            id activity_date total_steps total_distance tracker_distance
##         <dbl> <chr>               <dbl>          <dbl>            <dbl>
##  1 1503960366 4/12/2016           13162           8.5              8.5 
##  2 1503960366 4/13/2016           10735           6.97             6.97
##  3 1503960366 4/14/2016           10460           6.74             6.74
##  4 1503960366 4/15/2016            9762           6.28             6.28
##  5 1503960366 4/16/2016           12669           8.16             8.16
##  6 1503960366 4/17/2016            9705           6.48             6.48
##  7 1503960366 4/18/2016           13019           8.59             8.59
##  8 1503960366 4/19/2016           15506           9.88             9.88
##  9 1503960366 4/20/2016           10544           6.68             6.68
## 10 1503960366 4/21/2016            9819           6.34             6.34
## # ... with 930 more rows, and 10 more variables:
## #   logged_activities_distance <dbl>, very_active_distance <dbl>,
## #   moderately_active_distance <dbl>, light_active_distance <dbl>,
## #   sedentary_active_distance <dbl>, very_active_minutes <dbl>,
## #   fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
## #   sedentary_minutes <dbl>, calories <dbl>

# Cleaning dataset 2
clean_names(daily_sleep)

## # A tibble: 410 x 5
##            id sleep_day             total_sleep_rec~ total_minutes_a~ total_time_in_b~
##         <dbl> <chr>                            <dbl>            <dbl>            <dbl>
##  1 1503960366 4/12/2016 12:00:00 AM                1              327              346
##  2 1503960366 4/13/2016 12:00:00 AM                2              384              407
##  3 1503960366 4/15/2016 12:00:00 AM                1              412              442
##  4 1503960366 4/16/2016 12:00:00 AM                2              340              367
##  5 1503960366 4/17/2016 12:00:00 AM                1              700              712
##  6 1503960366 4/19/2016 12:00:00 AM                1              304              320
##  7 1503960366 4/20/2016 12:00:00 AM                1              360              377
##  8 1503960366 4/21/2016 12:00:00 AM                1              325              364
##  9 1503960366 4/23/2016 12:00:00 AM                1              361              384
## 10 1503960366 4/24/2016 12:00:00 AM                1              430              449
## # ... with 400 more rows

# Change the datatype of the data column, convert format to yyyy-mm-dd and rename it "date"
daily_activity <- daily_activity %>%
  rename(Date = ActivityDate) %>%
  mutate(Date = as_date(Date, format = "%m/%d/%Y"))

daily_sleep <- daily_sleep %>%
  rename(Date = SleepDay) %>%
  mutate(Date = as_date(Date,format ="%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))

## Warning: `tz` argument is ignored by `as_date()`

# Confirm column date is updated correctly
head(daily_activity)

## # A tibble: 6 x 15
##           Id Date       TotalSteps TotalDistance TrackerDistance LoggedActivitie~
##        <dbl> <date>          <dbl>         <dbl>           <dbl>            <dbl>
## 1 1503960366 2016-04-12      13162          8.5             8.5                 0
## 2 1503960366 2016-04-13      10735          6.97            6.97                0
## 3 1503960366 2016-04-14      10460          6.74            6.74                0
## 4 1503960366 2016-04-15       9762          6.28            6.28                0
## 5 1503960366 2016-04-16      12669          8.16            8.16                0
## 6 1503960366 2016-04-17       9705          6.48            6.48                0
## # ... with 9 more variables: VeryActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>

head(daily_sleep)

## # A tibble: 6 x 5
##           Id Date       TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
##        <dbl> <date>                 <dbl>              <dbl>          <dbl>
## 1 1503960366 2016-04-12                 1                327            346
## 2 1503960366 2016-04-13                 2                384            407
## 3 1503960366 2016-04-15                 1                412            442
## 4 1503960366 2016-04-16                 2                340            367
## 5 1503960366 2016-04-17                 1                700            712
## 6 1503960366 2016-04-19                 1                304            320

Guiding questions:

What tools are you choosing and why?
I am using R for data cleaning, transformation and visualization. R provides an accessible language to organize, modify, clean data frames and create insightful data visualizations.
Have you ensured your data’s integrity?
Data was collected in a survey, so I am unable to ascertain its integrity or accuracy.
What steps have you taken to ensure that your data is clean?
I removed duplicates, NA´s, cleaned, formated and converted date formats.
How can you verify that your data is clean and ready to analyze?
I checked all cleaning operations above and assured accuracy.
Have you documented your cleaning process so you can review and share those results?
yes, I used R Markdown and Code Chunks.

Step 4: Analyze

Now that the data is stored appropriately and has been prepared for analysis we can start putting it to work.

How many observations are there in each dataframe?

nrow(daily_activity)

## [1] 940

nrow(daily_sleep)

## [1] 410

What are some quick summary statistics we’d want to know about each data frame?

daily_activity %>%  
  select(TotalSteps,
         TotalDistance,
         SedentaryMinutes) %>%
  summary()

##    TotalSteps    TotalDistance    SedentaryMinutes
##  Min.   :    0   Min.   : 0.000   Min.   :   0.0  
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 729.8  
##  Median : 7406   Median : 5.245   Median :1057.5  
##  Mean   : 7638   Mean   : 5.490   Mean   : 991.2  
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:1229.5  
##  Max.   :36019   Max.   :28.030   Max.   :1440.0

daily_sleep %>%  
  select(TotalSleepRecords,
  TotalMinutesAsleep,
  TotalTimeInBed) %>%
  summary()

##  TotalSleepRecords TotalMinutesAsleep TotalTimeInBed 
##  Min.   :1.00      Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:1.00      1st Qu.:361.0      1st Qu.:403.8  
##  Median :1.00      Median :432.5      Median :463.0  
##  Mean   :1.12      Mean   :419.2      Mean   :458.5  
##  3rd Qu.:1.00      3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :3.00      Max.   :796.0      Max.   :961.0

Plotting a few explorations on user activities:

What’s the relationship between steps taken in a day and sedentary minutes? How could this help inform the customer segments that we can market to?

we are observing a negative relationship between total steps taken and sedentary minutes.
We can also note that sedentary time is not necessarily related to calories burned.
We also see that calories generally trend positively with total steps taking.

ggplot(data=daily_activity, aes(x=TotalSteps, y=SedentaryMinutes, color = Calories)) + geom_point()

What’s the relationship between minutes asleep and time in bed? You might expect it to be almost completely linear - are there any unexpected trends?

Observing the next graph We find some outliers. Some of these data points spent a lot of time in bed, but didn’t actually sleep, and then a small batch that slept a whole bunch and spent more time in bed.

ggplot(data=daily_sleep, aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) + geom_point(aes(color=Date))

What could these trends tell you about how to help market this product?

We could definitely market consumers to use their watch to better monitor their time in bed against their sleep time. ## Or areas where you might want to explore further?
I wonder which days of week users often spend more time logging? How does this relates to the sedentary minutes??

Merging the two datasets together, so that data can be more useful and acessible

daily_data <- merge(daily_activity, daily_sleep, by=c ("Id", "Date"))
glimpse(daily_data)

## Rows: 410
## Columns: 18
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036~
## $ Date                     <date> 2016-04-12, 2016-04-13, 2016-04-15, 2016-04-~
## $ TotalSteps               <dbl> 13162, 10735, 9762, 12669, 9705, 15506, 10544~
## $ TotalDistance            <dbl> 8.50, 6.97, 6.28, 8.16, 6.48, 9.88, 6.68, 6.3~
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.28, 8.16, 6.48, 9.88, 6.68, 6.3~
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.14, 2.71, 3.19, 3.53, 1.96, 1.3~
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 1.26, 0.41, 0.78, 1.32, 0.48, 0.3~
## $ LightActiveDistance      <dbl> 6.06, 4.71, 2.83, 5.04, 2.51, 5.03, 4.24, 4.6~
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveMinutes        <dbl> 25, 21, 29, 36, 38, 50, 28, 19, 41, 39, 73, 3~
## $ FairlyActiveMinutes      <dbl> 13, 19, 34, 10, 20, 31, 12, 8, 21, 5, 14, 23,~
## $ LightlyActiveMinutes     <dbl> 328, 217, 209, 221, 164, 264, 205, 211, 262, ~
## $ SedentaryMinutes         <dbl> 728, 776, 726, 773, 539, 775, 818, 838, 732, ~
## $ Calories                 <dbl> 1985, 1797, 1745, 1863, 1728, 2035, 1786, 177~
## $ TotalSleepRecords        <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ TotalMinutesAsleep       <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, ~
## $ TotalTimeInBed           <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, ~

n_distinct(daily_data$Id)

## [1] 24

We will execute a Full Outer Join To keep all rows from both data frames, specify all=TRUE.**

combined_data <- merge(daily_activity, daily_sleep, by=c ('Id', 'Date'), all = TRUE)
head(combined_data)

##           Id       Date TotalSteps TotalDistance TrackerDistance
## 1 1503960366 2016-04-12      13162          8.50            8.50
## 2 1503960366 2016-04-13      10735          6.97            6.97
## 3 1503960366 2016-04-14      10460          6.74            6.74
## 4 1503960366 2016-04-15       9762          6.28            6.28
## 5 1503960366 2016-04-16      12669          8.16            8.16
## 6 1503960366 2016-04-17       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728
##   TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1                 1                327            346
## 2                 2                384            407
## 3                NA                 NA             NA
## 4                 1                412            442
## 5                 2                340            367
## 6                 1                700            712

n_distinct(combined_data$Id)

## [1] 33

sum(is.na(combined_data))

## [1] 1590

combined_data <- combined_data %>% 
mutate_if(is.numeric, ~replace(., is.na(.), 0))

sum(is.na(combined_data))

## [1] 0

I will create new column “DayOfTheWeek” by generating date in the form of a number like 0=Monday, 1=Tuesday,… and also create new column “TotalMinutes” being the sum of VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes and SedentaryMinutes.

format(as.Date(combined_data$Date),"%w")

##   [1] "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5"
##  [19] "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6"
##  [37] "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3"
##  [55] "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4"
##  [73] "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1"
##  [91] "2" "3" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3"
## [109] "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4"
## [127] "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1"
## [145] "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0" "1" "2"
## [163] "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6"
## [181] "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0"
## [199] "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4"
## [217] "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5"
## [235] "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6"
## [253] "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "2" "3" "4" "5" "6"
## [271] "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3"
## [289] "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4"
## [307] "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "2" "3" "4" "5" "6" "0" "1" "2"
## [325] "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6"
## [343] "0" "1" "2" "3" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1"
## [361] "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2"
## [379] "3" "4" "5" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2"
## [397] "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3"
## [415] "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0"
## [433] "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0" "1"
## [451] "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5"
## [469] "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6"
## [487] "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3"
## [505] "4" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4"
## [523] "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5"
## [541] "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2"
## [559] "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0" "1" "2" "3"
## [577] "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0"
## [595] "1" "2" "3" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2"
## [613] "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6"
## [631] "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3"
## [649] "4" "5" "6" "0" "1" "2" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6"
## [667] "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "2" "3" "4" "5"
## [685] "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2"
## [703] "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0" "1" "2" "3"
## [721] "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "2"
## [739] "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6"
## [757] "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0"
## [775] "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4"
## [793] "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5"
## [811] "6" "0" "1" "2" "3" "4" "5" "6" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4"
## [829] "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1"
## [847] "2" "3" "4" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2"
## [865] "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3"
## [883] "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0"
## [901] "1" "2" "3" "4" "5" "6" "0" "1" "2" "2" "3" "4" "5" "6" "0" "1" "2" "3"
## [919] "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0"
## [937] "1" "2" "3" "4"

combined_data$DayOfTheWeek = weekdays(as.Date(combined_data$Date,format = "%Y-%m-%d"))
combined_data$DayOfTheWeek = factor(combined_data$Day, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday","Sunday"))
combined_data$TotalMinutes = combined_data$VeryActiveMinutes+combined_data$FairlyActiveMinutes+combined_data$LightlyActiveMinutes+combined_data$SedentaryMinutes
str(combined_data)

## 'data.frame':    940 obs. of  20 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ Date                    : Date, format: "2016-04-12" "2016-04-13" ...
##  $ TotalSteps              : num  13162 10735 10460 9762 12669 ...
##  $ TotalDistance           : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : num  25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : num  13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : num  328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : num  728 776 1218 726 773 ...
##  $ Calories                : num  1985 1797 1776 1745 1863 ...
##  $ TotalSleepRecords       : num  1 2 0 1 2 1 0 1 1 1 ...
##  $ TotalMinutesAsleep      : num  327 384 0 412 340 700 0 304 360 325 ...
##  $ TotalTimeInBed          : num  346 407 0 442 367 712 0 320 377 364 ...
##  $ DayOfTheWeek            : Factor w/ 7 levels "Monday","Tuesday",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ TotalMinutes            : num  1094 1033 1440 998 1040 ...

Seems only 50% worked…let´s try a different method and change these numbers to actual weekdays

combined_data$DayOfTheWeek <- format(as.Date(combined_data$Date),"%w")
wday(combined_data$Date, label=TRUE)

##   [1] ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui sex
##  [19] sáb dom seg ter qua qui sex sáb dom seg ter qua qui ter qua qui sex sáb
##  [37] dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua
##  [55] qui sex sáb dom seg ter qua qui ter qua qui sex sáb dom seg ter qua qui
##  [73] sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg
##  [91] ter qua ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua
## [109] qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui ter qua qui
## [127] sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg
## [145] ter qua qui sex sáb dom seg ter qua qui ter qua qui sex sáb dom seg ter
## [163] qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb
## [181] dom seg ter qua qui ter qua qui sex sáb dom seg ter qua qui sex sáb dom
## [199] seg ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui
## [217] ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui sex
## [235] sáb dom seg ter qua qui sex sáb dom seg ter qua qui ter qua qui sex sáb
## [253] dom seg ter qua qui sex sáb dom seg ter qua qui sex ter qua qui sex sáb
## [271] dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua
## [289] qui sex sáb dom seg ter qua qui ter qua qui sex sáb dom seg ter qua qui
## [307] sex sáb dom seg ter qua qui sex sáb dom ter qua qui sex sáb dom seg ter
## [325] qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb
## [343] dom seg ter qua ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg
## [361] ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui ter
## [379] qua qui sex ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter
## [397] qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui ter qua
## [415] qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb dom
## [433] seg ter qua qui sex sáb dom seg ter qua qui ter qua qui sex sáb dom seg
## [451] ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui sex
## [469] sáb dom seg ter qua qui ter qua qui sex sáb dom seg ter qua qui sex sáb
## [487] dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua
## [505] qui ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui
## [523] sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui ter qua qui sex
## [541] sáb dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter
## [559] qua qui sex sáb dom seg ter qua qui ter qua qui sex sáb dom seg ter qua
## [577] qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb dom
## [595] seg ter qua ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter
## [613] qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb
## [631] dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua
## [649] qui sex sáb dom seg ter ter qua qui sex sáb dom seg ter qua qui sex sáb
## [667] dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb ter qua qui sex
## [685] sáb dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter
## [703] qua qui sex sáb dom seg ter qua qui ter qua qui sex sáb dom seg ter qua
## [721] qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb ter
## [739] qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb
## [757] dom seg ter qua qui sex sáb dom seg ter qua qui ter qua qui sex sáb dom
## [775] seg ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui
## [793] sex sáb dom seg ter qua qui ter qua qui sex sáb dom seg ter qua qui sex
## [811] sáb dom seg ter qua qui sex sáb ter qua qui sex sáb dom seg ter qua qui
## [829] sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg
## [847] ter qua qui ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter
## [865] qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui ter qua
## [883] qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb dom
## [901] seg ter qua qui sex sáb dom seg ter ter qua qui sex sáb dom seg ter qua
## [919] qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb dom
## [937] seg ter qua qui
## Levels: dom < seg < ter < qua < qui < sex < sáb

combined_data$DayOfTheWeek = strftime(combined_data$Date,'%A')

combined_data$TotalHours <- round((combined_data$TotalMinutes/60), digits=2)

combined_data %>%  
  select(TotalSteps,
  SedentaryMinutes,
  Calories) %>%
  summary()

##    TotalSteps    SedentaryMinutes    Calories   
##  Min.   :    0   Min.   :   0.0   Min.   :   0  
##  1st Qu.: 3790   1st Qu.: 729.8   1st Qu.:1828  
##  Median : 7406   Median :1057.5   Median :2134  
##  Mean   : 7638   Mean   : 991.2   Mean   :2304  
##  3rd Qu.:10727   3rd Qu.:1229.5   3rd Qu.:2793  
##  Max.   :36019   Max.   :1440.0   Max.   :4900

Interpreting statistical findings:

On average, users logged 7,652 steps or 5.4km which is not adequate. As recommended by OMS, an adult female has to aim at least 10,000 steps or 8km per day to benefit from general health, weight loss and fitness improvement. Source:Médis article
Sedentary users are the majority logging on average 990.4 minutes or 20 hours making up 81% of total average minutes.
Noting that average calories burned is 2,308 calories equivalent to 0.3 Kg. Could not interpret into detail as calories burned depend on several factors such as the age, weight, daily tasks, exercise, hormones and daily calorie intake. Source: A tua Saúde article

#install.packages("ggpubr")
library(ggpubr)
p1= ggplot(data=combined_data)+
      geom_point(mapping = aes(x = TotalDistance, y =Calories),color="yellow")+
      labs(title="Total Distance vs. Calories")
p2= ggplot(data=combined_data)+
      geom_point(mapping = aes(x = TotalSteps, y =Calories),color="green")+
      labs(title="Total Steps vs. Calories")
p3= ggplot(data=combined_data)+
      geom_point(mapping = aes(x = TotalMinutes, y =Calories),color="red")+
      labs(title="Total Minutes vs. Calories")
ggarrange(p1, p2, p3, ncol = 3, nrow = 1)

By quickly analyzing these 3 scenarios, we observe “TotalDistance” is more closely related to “Calories”, but let´s explore a little further.

Guiding questions:

How should you organize your data to perform analysis on it?
I merged “daily_activity” and “daily_sleep”, performed a full join and replaced NA´s with “0” and confirmed all 33 users om the “combined_data” dataset, so data can be more useful and accessible.
Has your data been properly formatted? Yes, dates were converted to European format to be more accurate, consistent, and easy to read.
What surprises did you discover in the data? we are observing a negative relationship between total steps taken and sedentary minutes
What trends or relationships did you find in the data?
We can also note that sedentary time is not necessarily related to calories burned. We also see that calories generally trend positively with total steps taking.
How will these insights help answer your business questions?
These insights should be able to show us which groups of consumers poorly make use of Bellabeat smart-devices.

Step 5: Share

Plot Nr. of times users logged in app across the week

# Correcting Weekdays order manually
combined_data <- combined_data          
combined_data$DayOfTheWeek <- factor(combined_data$DayOfTheWeek, levels = c("segunda-feira", "terça-feira", "quarta-feira", "quinta-feira", "sexta-feira", "sábado", "domingo"))

# Manually ordered barchart
ggplot(data=combined_data,aes(x=DayOfTheWeek, fill=DayOfTheWeek)) + geom_bar(stat = "count") +
  theme(plot.title = element_text(hjust = 0.5, lineheight = 0.8, face = "bold")) + 
  labs(x = 'Day of Week',
       y = 'Frequency',
       title = 'Nr. of times users logged in app across the week')

In this bar chart, we are looking at the frequency of FitBit app usage in terms of days of the week.

We discovered that users prefer or remember (giving them the doubt of benefit that they forgotten) to track their activity on the app during midweek from Tuesday to Friday.
Noting that the frequency dropped on Friday and continue on weekends and Monday.

Plot Calories burned for every step taken:

combined_data %>%  
  select(TotalSteps,
  Calories) %>%
  summary()

##    TotalSteps       Calories   
##  Min.   :    0   Min.   :   0  
##  1st Qu.: 3790   1st Qu.:1828  
##  Median : 7406   Median :2134  
##  Mean   : 7638   Mean   :2304  
##  3rd Qu.:10727   3rd Qu.:2793  
##  Max.   :36019   Max.   :4900

ggplot(data = combined_data) + geom_point(mapping = aes(x=TotalSteps, y=Calories, color=TotalSteps)) +        scale_color_gradientn(colours = "rainbow"(6)) + 
  geom_hline(yintercept = 2304, color = "red", size = 1) + 
  geom_vline(xintercept = 7638, color = "blue", size = 1) +
  geom_text(aes(x=10000, y=2100, label="Mean"), color="black", size=5) +
theme(plot.title = element_text(hjust = 0.5, lineheight = 0.8, face = "bold")) +
  labs(
  x = 'Steps taken',
  y = 'Calories burned',
  title = 'Calories burned for every step taken')

Calories burned for every step taken

From the scatter plot, we discovered that:

We have a positive correlation, meaning that Calories are burned for every step taken
We observed that intensity of calories burned increase when users are at the range of > 0 to 15,000 steps with calories burn rate cooling down from 15,000 steps onwards.

Noted a few outliers: * Zero steps with zero to minimal calories burned. * 1 observation of > 35,000 steps with < 3,000 calories burned. * Deduced that outliers could be due to natural variation of data, change in user’s usage or errors in data collection (ie. miscalculations, data contamination or human error).

Plot Calories burned for every Hour logged:

combined_data %>%  
  select(TotalHours, SedentaryMinutes,
  Calories) %>%
  summary()

##    TotalHours    SedentaryMinutes    Calories   
##  Min.   : 0.03   Min.   :   0.0   Min.   :   0  
##  1st Qu.:16.50   1st Qu.: 729.8   1st Qu.:1828  
##  Median :24.00   Median :1057.5   Median :2134  
##  Mean   :20.31   Mean   : 991.2   Mean   :2304  
##  3rd Qu.:24.00   3rd Qu.:1229.5   3rd Qu.:2793  
##  Max.   :24.00   Max.   :1440.0   Max.   :4900

ggplot(data = combined_data) + geom_point(mapping = aes(x=TotalHours, y=Calories, color=TotalSteps)) + 
  scale_color_gradientn(colours = "rainbow"(3)) + 
theme(plot.title = element_text(hjust = 0.5, lineheight = 0.8, face = "bold")) + 
    geom_hline(aes(yintercept= 2304, linetype = "Average Hours"), colour= 'red', size=1) +
    geom_vline(aes(xintercept= 991/60, linetype = "Average Sedentary"), colour= 'purple', size=1) +
    geom_vline(aes(xintercept = 20.31, linetype = "Average Steps"), colour='blue', size=1) +
    scale_linetype_manual(name = "Statistics", values = c(2, 2, 2), 
                      guide = guide_legend(override.aes = list(color = c("blue", "red", "purple")))) +
  labs(
  x = 'Hours logged',
  y = 'Calories burned',
  title = 'Calories burned for every hour logged')

Calories burned for every hour logged

The scatter plot is showing:

**A weak positive correlation whereby the increase of hours logged does not translate to more calories being burned. That is largely due to the average sedentary hours (purple line) plotted at the 16 to 17 hours range.

Again, we can see a few outliers:
* The same zero value outliers. * An unusual red dot at the 24 hours with zero calorie burned which may be due to the same reasons as above.

Plot % of Activity in Minutes:

Data_Ind <- combined_data %>%
  summarise(Sum_VAM = sum(VeryActiveMinutes/1148807*100), 
         Sum_FAM = sum(FairlyActiveMinutes/1148807*100), 
         Sum_LAM = sum(LightlyActiveMinutes/1148807*100), 
         Sum_SEM = sum(SedentaryMinutes/1148807*100),
         Sum_TOTAL=sum(VeryActiveMinutes+FairlyActiveMinutes+LightlyActiveMinutes+SedentaryMinutes)) %>% 
  round(digits = 2)

slices <- c(Data_Ind$Sum_VAM, Data_Ind$Sum_FAM, Data_Ind$Sum_LAM, Data_Ind$Sum_SEM)
lbls <- c("Very Active Min", "Fairly Active Min", "Lightly Active Min", "Sedentary Min")
pie(slices,
    labels=paste(lbls, slices, sep=" ", "%"),
    col = rainbow(6),
    main="Pie Chart - % of Activity in Minutes")

Percentage of Activity in Minutes

As seen from the pie chart:

Sedentary minutes takes the biggest slice at 81.10%.
This indicates that users are using the FitBit app to log daily activities such as daily commute, inactive movements (moving from one spot to another) or running errands.
The App is rarely used to track fitness (ie. running) according to the minor percentage of Fairly Active Activity (1.11%) and Very Active Activity (1.73%).
This is highly discouraging as FitBit app was developed to encourage fitness.

Step 6: Act

In the final step, we will be delivering our insights and providing recommendations based on our analysis. Here, we revisit our business questions and share with you our high-level business recommendations.

1. What are some trends in smart device usage?

Majority of users (81.10%) are using the FitBit app to track sedentary activities and not using it for tracking their health habits. Users prefer to track their activities during weekdays as compared to weekends - perhaps because they spend more time outside on weekdays and stay in on weekends. Data also tell us that most users log in their calories, steps taken, etc, and fewer log their sleep data.

2. How could these trends apply to Bellabeat customers?

Both companies develop products focused on providing women with their health, habit and fitness data and encouraging them to understand their current habits and make healthy decisions. These common trends surrounding health and fitness can very well be applied to Bellabeat customers. Bellabeat could easily market these type of costumers by telling them smart-devices could help them start their journey by measuring how much they’re moving and how these moments of activity would benefit them to live longer!

3. How could these trends help influence Bellabeat marketing strategy?

It is well documented that moderate-to-vigorous physical activity is protective against chronic disease. Conversely, emerging evidence indicates the deleterious effects of prolonged sitting, so in a need to change both behaviors, self-monitoring of behavior is one of the most robust behavior-change techniques available. Bellabeat marketing team can encourage users by educating and equipping them with knowledge about fitness benefits, suggest different types of exercise (ie. simple 10 minutes exercise on weekday and a more intense exercise on weekends) and calories intake and burnt rate information on the Bellabeat app.On weekends, Bellabeat app can also prompt notification to encourage users to exercise. By marketing these devices to consumers, Bellabeat provides a unique opportunity for individuals to change their behavior, become more physically active and increase their life expectancy.

Case Study 2: Bellabeat

Marlene Seleiro - Google Data Analytics Capstone - Portfolio-Ready Case Study

08/11/2021

Bellabeat - How Can a Wellness Technology Company Play It Smart?

Step 1: Ask

Guiding questions

Business questions:

Produce a report with the following deliverables:

Step 2: Prepare

Guiding questions:

Step 3: Process

Upload CSV files to R

Installing and loading common packages and libraries

Understanding some summary statistics

Guiding questions:

Step 4: Analyze

What are some quick summary statistics we’d want to know about each data frame?

Plotting a few explorations on user activities:

What’s the relationship between minutes asleep and time in bed? You might expect it to be almost completely linear - are there any unexpected trends?

What could these trends tell you about how to help market this product?

Merging the two datasets together, so that data can be more useful and acessible

We will execute a Full Outer Join To keep all rows from both data frames, specify all=TRUE.**

I will create new column “DayOfTheWeek” by generating date in the form of a number like 0=Monday, 1=Tuesday,… and also create new column “TotalMinutes” being the sum of VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes and SedentaryMinutes.

Seems only 50% worked…let´s try a different method and change these numbers to actual weekdays

Interpreting statistical findings:

Guiding questions:

Step 6: Act

1. What are some trends in smart device usage?

2. How could these trends apply to Bellabeat customers?

3. How could these trends help influence Bellabeat marketing strategy?