Introduction

This analysis is for case study #2 from the Google Data Analytics Certificate (Bellabeat).

It’s originally based on the case study “FitBit Fitness Tracker Data” by Möbius found here.

The purpose of this script is to clean and consolidate data to conduct an analysis that aims to get insights into how consumers are using their smart data. These insights will help guide the marketing strategy for the company.

Questions to be answered:

Packages used for the cleaning, analysis and visualization

  • tidyverse for data import and wrangling
  • lubridate for date functions
  • ggplot for visualization
  • janitor for data cleaning
  • hms for time variables handling

STEP 1: COLLECT DATA

Load csv files into different data frames

# Loop through csv files in directory and load to corresponding data frames
for (data in list.files(path = "Data",pattern="*.csv")) {
   name <- strsplit(data, "[_]")[[1]][1]
   assign(name, read_csv(paste("data",data, sep = "/")))
}
# Remove temporary variables created in the loop
rm(data)
rm(name)

Data overview

We will analyze the daily information corresponding to:

  • Daily Activity
  • Sleep
  • Weight
# Check sample of each data set
head(dailyActivity)
## # A tibble: 6 x 15
##        Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie~
##     <dbl> <chr>             <dbl>         <dbl>           <dbl>            <dbl>
## 1  1.50e9 4/12/2016         13162          8.5             8.5                 0
## 2  1.50e9 4/13/2016         10735          6.97            6.97                0
## 3  1.50e9 4/14/2016         10460          6.74            6.74                0
## 4  1.50e9 4/15/2016          9762          6.28            6.28                0
## 5  1.50e9 4/16/2016         12669          8.16            8.16                0
## 6  1.50e9 4/17/2016          9705          6.48            6.48                0
## # ... with 9 more variables: VeryActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>
head(sleepDay)
## # A tibble: 6 x 5
##          Id SleepDay           TotalSleepRecor~ TotalMinutesAsle~ TotalTimeInBed
##       <dbl> <chr>                         <dbl>             <dbl>          <dbl>
## 1    1.50e9 4/12/2016 12:00:0~                1               327            346
## 2    1.50e9 4/13/2016 12:00:0~                2               384            407
## 3    1.50e9 4/15/2016 12:00:0~                1               412            442
## 4    1.50e9 4/16/2016 12:00:0~                2               340            367
## 5    1.50e9 4/17/2016 12:00:0~                1               700            712
## 6    1.50e9 4/19/2016 12:00:0~                1               304            320
head(weightLogInfo)
## # A tibble: 6 x 8
##         Id Date       WeightKg WeightPounds   Fat   BMI IsManualReport     LogId
##      <dbl> <chr>         <dbl>        <dbl> <dbl> <dbl> <lgl>              <dbl>
## 1   1.50e9 5/2/2016 ~     52.6         116.    22  22.6 TRUE             1.46e12
## 2   1.50e9 5/3/2016 ~     52.6         116.    NA  22.6 TRUE             1.46e12
## 3   1.93e9 4/13/2016~    134.          294.    NA  47.5 FALSE            1.46e12
## 4   2.87e9 4/21/2016~     56.7         125.    NA  21.5 TRUE             1.46e12
## 5   2.87e9 5/12/2016~     57.3         126.    NA  21.7 TRUE             1.46e12
## 6   4.32e9 4/17/2016~     72.4         160.    25  27.5 TRUE             1.46e12
# Check number of users
length(unique(dailyActivity$Id)) #33
## [1] 33
length(unique(sleepDay$Id)) #24
## [1] 24
length(unique(weightLogInfo$Id)) #8
## [1] 8

Limitation: There is no sleep and weight information for all 33 users

STEP 2: WRANGLE DATA AND COMBINE INTO OTHER FILES

1. Daily Activity

Let’s start cleaning the Daily Activity dataset

# Data frame structure
str(dailyActivity)
## spec_tbl_df [940 x 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id                      : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : num [1:940] 13162 10735 10460 9762 12669 ...
##  $ TotalDistance           : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : num [1:940] 728 776 1218 726 773 ...
##  $ Calories                : num [1:940] 1985 1797 1776 1745 1863 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   ActivityDate = col_character(),
##   ..   TotalSteps = col_double(),
##   ..   TotalDistance = col_double(),
##   ..   TrackerDistance = col_double(),
##   ..   LoggedActivitiesDistance = col_double(),
##   ..   VeryActiveDistance = col_double(),
##   ..   ModeratelyActiveDistance = col_double(),
##   ..   LightActiveDistance = col_double(),
##   ..   SedentaryActiveDistance = col_double(),
##   ..   VeryActiveMinutes = col_double(),
##   ..   FairlyActiveMinutes = col_double(),
##   ..   LightlyActiveMinutes = col_double(),
##   ..   SedentaryMinutes = col_double(),
##   ..   Calories = col_double()
##   .. )
# Data frame sample
head(dailyActivity)
## # A tibble: 6 x 15
##        Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie~
##     <dbl> <chr>             <dbl>         <dbl>           <dbl>            <dbl>
## 1  1.50e9 4/12/2016         13162          8.5             8.5                 0
## 2  1.50e9 4/13/2016         10735          6.97            6.97                0
## 3  1.50e9 4/14/2016         10460          6.74            6.74                0
## 4  1.50e9 4/15/2016          9762          6.28            6.28                0
## 5  1.50e9 4/16/2016         12669          8.16            8.16                0
## 6  1.50e9 4/17/2016          9705          6.48            6.48                0
## # ... with 9 more variables: VeryActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>

Cleaning steps:

1.1 Activity Date is a string, convert to date

dailyActivity$ActivityDate <- as.Date(dailyActivity$ActivityDate, format = "%m/%d/%Y")

1.2. Check if total distance and tracker distance are the same to remove one of them

sum(dailyActivity$TotalDistance == dailyActivity$TrackerDistance) #925
## [1] 925
sum(dailyActivity$TotalDistance != dailyActivity$TrackerDistance) #15
## [1] 15
nrow(dailyActivity)
## [1] 940
# Check those rows
dailyActivity[dailyActivity$TotalDistance != dailyActivity$TrackerDistance, ]
## # A tibble: 15 x 15
##        Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie~
##     <dbl> <date>            <dbl>         <dbl>           <dbl>            <dbl>
##  1 6.96e9 2016-04-21        11835          9.71            7.88             4.08
##  2 6.96e9 2016-04-25        13239          9.27            9.08             2.79
##  3 6.96e9 2016-05-09        12342          8.72            8.68             3.17
##  4 7.01e9 2016-04-12        14172         10.3             9.48             4.87
##  5 7.01e9 2016-04-13        12862          9.65            8.60             4.85
##  6 7.01e9 2016-04-14        11179          8.24            7.48             3.29
##  7 7.01e9 2016-04-18        14816         11.0             9.91             4.93
##  8 7.01e9 2016-04-19        14194         10.5             9.5              4.94
##  9 7.01e9 2016-04-20        15566         11.3            10.4              4.92
## 10 7.01e9 2016-04-25        18229         13.3            12.2              4.86
## 11 7.01e9 2016-04-27        13541         10.2             9.06             4.89
## 12 7.01e9 2016-04-29        20067         14.3            13.4              4.91
## 13 7.01e9 2016-05-02        13041          9.18            8.72             2.83
## 14 7.01e9 2016-05-03        14510         10.9             9.71             4.91
## 15 7.01e9 2016-05-05        15010         11.1            10.0              4.88
## # ... with 9 more variables: VeryActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>

Since there are just a few rows (15/940) where the values of Total Distance and Tracker distance are different, and there is not enough information about the meaning of them, we will remove the Tracker Distance and work only with Total Distance.

dailyActivity$TrackerDistance <- NULL

1.3. Check if Logged Activities Distance is relevant

table(dailyActivity$LoggedActivitiesDistance)
## 
##                0 1.95959603786469  2.0921471118927 2.25308108329773 
##              908                1                9                7 
## 2.78517508506775 2.83232593536377 3.16782188415527 3.28541493415833 
##                1                1                1                1 
## 4.08169221878052  4.8513069152832 4.86179208755493  4.8697829246521 
##                1                1                1                1 
##  4.8782320022583 4.88560485839844 4.91114616394043 4.91236782073975 
##                1                1                1                1 
## 4.92484092712402 4.93055009841919 4.94214200973511 
##                1                1                1

908/940 rows have a value of 0 for Logged Activities Distance

It will not be considered for the analysis

dailyActivity$LoggedActivitiesDistance <- NULL

1.4. Identify and remove outliers in daily calories

summary(dailyActivity$Calories)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    1828    2134    2304    2793    4900
boxplot(dailyActivity$Calories, horizontal = T)

# Store outliers in variable
cal_outliers <- boxplot(dailyActivity$Calories, horizontal = T)$out

# How many rows to remove?
nrow(dailyActivity[dailyActivity$Calories %in% cal_outliers, "Calories"]) # 16
## [1] 16
# Remove rows with outliers in calories
dailyActivity <- dailyActivity[!dailyActivity$Calories %in% cal_outliers, ]

rm(cal_outliers)

1.5. Look for missing values

sum(is.na(dailyActivity)) # No missing values
## [1] 0

2. Daily Sleep

Now clean the Sleep dataset

# Data frame structure
str(sleepDay)
## spec_tbl_df [413 x 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id                : num [1:413] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ SleepDay          : chr [1:413] "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
##  $ TotalSleepRecords : num [1:413] 1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: num [1:413] 327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : num [1:413] 346 407 442 367 712 320 377 364 384 449 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   SleepDay = col_character(),
##   ..   TotalSleepRecords = col_double(),
##   ..   TotalMinutesAsleep = col_double(),
##   ..   TotalTimeInBed = col_double()
##   .. )
# Data frame sample

head(sleepDay)
## # A tibble: 6 x 5
##          Id SleepDay           TotalSleepRecor~ TotalMinutesAsle~ TotalTimeInBed
##       <dbl> <chr>                         <dbl>             <dbl>          <dbl>
## 1    1.50e9 4/12/2016 12:00:0~                1               327            346
## 2    1.50e9 4/13/2016 12:00:0~                2               384            407
## 3    1.50e9 4/15/2016 12:00:0~                1               412            442
## 4    1.50e9 4/16/2016 12:00:0~                2               340            367
## 5    1.50e9 4/17/2016 12:00:0~                1               700            712
## 6    1.50e9 4/19/2016 12:00:0~                1               304            320

Cleaning steps:

2.1. Sleep Day is a string, convert to date and change the name

sleepDay$SleepDay <- as.Date(sleepDay$SleepDay, format = "%m/%d/%Y")
sleepDay <- rename(sleepDay, Date = SleepDay)

2.2. Remove sleep records since it is not significant for the analysis

sleepDay$TotalSleepRecords <- NULL

2.3. Look for missing values

sum(is.na(sleepDay)) # No missing values
## [1] 0

3. Weight

Lastly, clean the Weight dataset

# Data frame structure
str(weightLogInfo)
## spec_tbl_df [67 x 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id            : num [1:67] 1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
##  $ Date          : chr [1:67] "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
##  $ WeightKg      : num [1:67] 52.6 52.6 133.5 56.7 57.3 ...
##  $ WeightPounds  : num [1:67] 116 116 294 125 126 ...
##  $ Fat           : num [1:67] 22 NA NA NA NA 25 NA NA NA NA ...
##  $ BMI           : num [1:67] 22.6 22.6 47.5 21.5 21.7 ...
##  $ IsManualReport: logi [1:67] TRUE TRUE FALSE TRUE TRUE TRUE ...
##  $ LogId         : num [1:67] 1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   Date = col_character(),
##   ..   WeightKg = col_double(),
##   ..   WeightPounds = col_double(),
##   ..   Fat = col_double(),
##   ..   BMI = col_double(),
##   ..   IsManualReport = col_logical(),
##   ..   LogId = col_double()
##   .. )
# Data frame sample

head(weightLogInfo)
## # A tibble: 6 x 8
##         Id Date       WeightKg WeightPounds   Fat   BMI IsManualReport     LogId
##      <dbl> <chr>         <dbl>        <dbl> <dbl> <dbl> <lgl>              <dbl>
## 1   1.50e9 5/2/2016 ~     52.6         116.    22  22.6 TRUE             1.46e12
## 2   1.50e9 5/3/2016 ~     52.6         116.    NA  22.6 TRUE             1.46e12
## 3   1.93e9 4/13/2016~    134.          294.    NA  47.5 FALSE            1.46e12
## 4   2.87e9 4/21/2016~     56.7         125.    NA  21.5 TRUE             1.46e12
## 5   2.87e9 5/12/2016~     57.3         126.    NA  21.7 TRUE             1.46e12
## 6   4.32e9 4/17/2016~     72.4         160.    25  27.5 TRUE             1.46e12

Cleaning steps:

3.1. Date is a string, convert to date

weightLogInfo$Date <- as.Date(weightLogInfo$Date, format = "%m/%d/%Y")

3.2. Remove not significant columns for the analysis

weightLogInfo$IsManualReport <- NULL
weightLogInfo$LogId <- NULL

3.3. Look for missing values

sum(is.na(weightLogInfo)) # 65 NAs
## [1] 65
#Check the columns where NAs exist
sum(is.na(weightLogInfo$Id)) # 0
## [1] 0
sum(is.na(weightLogInfo$Date)) # 0
## [1] 0
sum(is.na(weightLogInfo$WeightKg)) # 0
## [1] 0
sum(is.na(weightLogInfo$WeightPounds)) # 0
## [1] 0
sum(is.na(weightLogInfo$Fat)) # 65
## [1] 65
sum(is.na(weightLogInfo$BMI)) # 0
## [1] 0
sum(is.na(weightLogInfo$Fat)) / nrow(weightLogInfo) # 97% of data is missing, remove variable
## [1] 0.9701493
weightLogInfo$Fat <- NULL

3. Combine the datasets

We will use the ID and the Date to join the three tables

First, we need to validate that the join will be clean

# Rename ActivityDate to Date
dailyActivity <- rename(dailyActivity, Date = ActivityDate)

nrow(dailyActivity) # 924
## [1] 924
dailyActivity %>%
  left_join(sleepDay, by = c("Id", "Date")) %>%
  left_join(weightLogInfo, by = c("Id", "Date")) %>%
  tally() # 927
## # A tibble: 1 x 1
##       n
##   <int>
## 1   927
# Check for duplicated registers

sum(duplicated(dailyActivity)) # 0
## [1] 0
sum(duplicated(sleepDay)) # 3
## [1] 3
sum(duplicated(weightLogInfo)) # 0
## [1] 0
# Remove the duplicated registers
sleepDay <- sleepDay[!duplicated(sleepDay), ]

Perform the join and save the result in a new variable

dailyData <- dailyActivity %>%
  left_join(sleepDay, by = c("Id", "Date")) %>%
  left_join(weightLogInfo, by = c("Id", "Date"))

4. Model the dataframe

4.1. Enrich the data

# Date enrichment
dailyData <- dailyData %>%
  mutate(Day = day(Date),
         Weekday = wday(Date, label = TRUE, abbr = FALSE, week_start = 1),
         Month = month(Date, label = TRUE),
         Year = year(Date))

# Validate Total Distance = SedentaryActiveDistance + LightActiveDistance + ModeratelyActiveDistance + VeryActiveDistance
dailyData %>%
  mutate(SumOfDistances = SedentaryActiveDistance + LightActiveDistance + ModeratelyActiveDistance + VeryActiveDistance) %>%
  filter(TotalDistance != SumOfDistances) %>%
  select(Id, Date, TotalDistance, SumOfDistances)
## # A tibble: 628 x 4
##            Id Date       TotalDistance SumOfDistances
##         <dbl> <date>             <dbl>          <dbl>
##  1 1503960366 2016-04-12          8.5            8.49
##  2 1503960366 2016-04-13          6.97           6.97
##  3 1503960366 2016-04-14          6.74           6.75
##  4 1503960366 2016-04-15          6.28           6.23
##  5 1503960366 2016-04-16          8.16           8.16
##  6 1503960366 2016-04-17          6.48           6.48
##  7 1503960366 2016-04-18          8.59           8.60
##  8 1503960366 2016-04-19          9.88           9.88
##  9 1503960366 2016-04-20          6.68           6.68
## 10 1503960366 2016-04-21          6.34           6.34
## # ... with 618 more rows
# The differences are not significant (<0.01)

# Add total minutes column
dailyData <- dailyData %>%
  mutate(TotalMinutes = SedentaryMinutes + LightlyActiveMinutes +
           FairlyActiveMinutes + VeryActiveMinutes)

4.2. To facilitate the analysis of the data, let’s create a long version of it

dailyDataLong <- dailyData %>%
  gather(ActivityTypeDistance, Distance,
         c("SedentaryActiveDistance","LightActiveDistance",
           "ModeratelyActiveDistance","VeryActiveDistance")) %>%
  gather(ActivityTypeMin, Minutes,
         c("SedentaryMinutes","LightlyActiveMinutes",
           "FairlyActiveMinutes","VeryActiveMinutes")) %>%
  arrange(Id, Date) %>%
  mutate(ActivityTypeDistance = as.factor(ActivityTypeDistance),
         ActivityTypeMin = as.factor(ActivityTypeMin))


# Redefine levels for distance
levels(dailyDataLong$ActivityTypeDistance) <- c("Lightly Active", "Moderately Active", "Sedentary Active", "Very Active")

dailyDataLong$ActivityTypeDistance <- factor(dailyDataLong$ActivityTypeDistance,
                                             levels = c("Sedentary Active", "Lightly Active",
                                                        "Moderately Active", "Very Active"))

# Redefine levels for minutes
levels(dailyDataLong$ActivityTypeMin) <- c("Moderately Active", "Lightly Active", "Sedentary Active", "Very Active")

dailyDataLong$ActivityTypeMin <- factor(dailyDataLong$ActivityTypeMin,
                                             levels = c("Sedentary Active", "Lightly Active",
                                                        "Moderately Active", "Very Active"))

4.3. Check Activity type for distance vs. for minutes

Are ActivityTypeDistance and ActivityTypeMin the same?

dailyDataLong %>%
  filter(ActivityTypeDistance != ActivityTypeMin) %>%
  select(Id, Date, ActivityTypeDistance, ActivityTypeMin)
## # A tibble: 11,088 x 4
##            Id Date       ActivityTypeDistance ActivityTypeMin  
##         <dbl> <date>     <fct>                <fct>            
##  1 1503960366 2016-04-12 Lightly Active       Sedentary Active 
##  2 1503960366 2016-04-12 Moderately Active    Sedentary Active 
##  3 1503960366 2016-04-12 Very Active          Sedentary Active 
##  4 1503960366 2016-04-12 Sedentary Active     Lightly Active   
##  5 1503960366 2016-04-12 Moderately Active    Lightly Active   
##  6 1503960366 2016-04-12 Very Active          Lightly Active   
##  7 1503960366 2016-04-12 Sedentary Active     Moderately Active
##  8 1503960366 2016-04-12 Lightly Active       Moderately Active
##  9 1503960366 2016-04-12 Very Active          Moderately Active
## 10 1503960366 2016-04-12 Sedentary Active     Very Active      
## # ... with 11,078 more rows

No, we can have a light activity regarding distance, but moderate for minutes

More information needed to completely understand how these values are defined

4.4. Calculate the percentage of missing values in each column

colMeans(is.na(dailyData))
##                       Id                     Date               TotalSteps 
##                0.0000000                0.0000000                0.0000000 
##            TotalDistance       VeryActiveDistance ModeratelyActiveDistance 
##                0.0000000                0.0000000                0.0000000 
##      LightActiveDistance  SedentaryActiveDistance        VeryActiveMinutes 
##                0.0000000                0.0000000                0.0000000 
##      FairlyActiveMinutes     LightlyActiveMinutes         SedentaryMinutes 
##                0.0000000                0.0000000                0.0000000 
##                 Calories       TotalMinutesAsleep           TotalTimeInBed 
##                0.0000000                0.5638528                0.5638528 
##                 WeightKg             WeightPounds                      BMI 
##                0.9307359                0.9307359                0.9307359 
##                      Day                  Weekday                    Month 
##                0.0000000                0.0000000                0.0000000 
##                     Year             TotalMinutes 
##                0.0000000                0.0000000

Not enough data for Weight and BMI

Almost half of the Sleep data is missing

STEP 3: CONDUCT DESCRIPTIVE ANALYSIS

1. Let’s check the behavior of Total Distance and Total minutes throughout the week

Total Distance

dailyData %>%
  group_by(Weekday) %>%
  summarise(TotalDistance = sum(TotalDistance)) %>%
  ggplot(aes(x=Weekday, y=TotalDistance,
             fill=factor(ifelse(Weekday %in% c("Tuesday", "Wednesday"),
                                "Highlighted","Normal")))) +
  geom_col() +
  theme(axis.title.x = element_blank(),
        legend.position = "none") +
  ylab("Total Distance") +
  labs(title = "Total distance per weekday",
       subtitle = "Tuesday and Wednesday are the days people travel the longest distances") +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        axis.line = element_line(colour = "black"))

Total Minutes

dailyData %>%
  group_by(Weekday) %>%
  summarise(TotalMinutes = sum(TotalMinutes)) %>%
  ggplot(aes(x=Weekday, y=TotalMinutes,
             fill=factor(ifelse(Weekday %in% c("Tuesday", "Wednesday"),
                                "Highlighted","Normal")))) +
  geom_col() +
  theme(axis.title.x = element_blank(),
        legend.position = "none") +
  ylab("Total Minutes") +
  labs(title = "Total Minutes per weekday",
       subtitle = "Tuesday and Wednesday are the days people spend more time in activity") +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        axis.line = element_line(colour = "black"))

2. Look for correlation between the variables

ggplot(data=dailyData, aes(x=TotalMinutesAsleep, y=Calories)) + geom_point()

cor(dailyData[!is.na(dailyData$TotalMinutesAsleep),c("TotalMinutesAsleep", "Calories")]) #-0.0601
##                    TotalMinutesAsleep    Calories
## TotalMinutesAsleep         1.00000000 -0.06017452
## Calories                  -0.06017452  1.00000000
# What variable is more related to calories burn?
ggplot(data=dailyData, aes(x=TotalSteps, y=Calories)) + geom_point()

cor(dailyData[,c("TotalSteps", "Calories")]) #0.5645
##            TotalSteps  Calories
## TotalSteps  1.0000000 0.5645245
## Calories    0.5645245 1.0000000
ggplot(data=dailyData, aes(x=TotalDistance, y=Calories)) + geom_point()

cor(dailyData[,c("TotalDistance", "Calories")]) #0.6262
##               TotalDistance  Calories
## TotalDistance     1.0000000 0.6261985
## Calories          0.6261985 1.0000000
ggplot(data=dailyData, aes(x=TotalMinutes, y=Calories)) + geom_point()

cor(dailyData[,c("TotalMinutes", "Calories")]) #0.7614
##              TotalMinutes   Calories
## TotalMinutes   1.00000000 0.07613576
## Calories       0.07613576 1.00000000

The minutes of activity are more correlated to calories burned

Now let’s see which type of activity is more correlated to calories burned

# What type of activity is more related to calories burn?
# Distances
cor(dailyData[,c("SedentaryActiveDistance",
                 "LightActiveDistance",
                 "ModeratelyActiveDistance",
                 "VeryActiveDistance",
                 "Calories")])
##                          SedentaryActiveDistance LightActiveDistance
## SedentaryActiveDistance               1.00000000          0.09930055
## LightActiveDistance                   0.09930055          1.00000000
## ModeratelyActiveDistance              0.00558149          0.23445082
## VeryActiveDistance                    0.05280515          0.14116607
## Calories                              0.04767849          0.45853924
##                          ModeratelyActiveDistance VeryActiveDistance   Calories
## SedentaryActiveDistance                0.00558149         0.05280515 0.04767849
## LightActiveDistance                    0.23445082         0.14116607 0.45853924
## ModeratelyActiveDistance               1.00000000         0.19355232 0.20527455
## VeryActiveDistance                     0.19355232         1.00000000 0.47511959
## Calories                               0.20527455         0.47511959 1.00000000
#Minutes
cor(dailyData[,c("SedentaryMinutes",
                 "LightlyActiveMinutes",
                 "FairlyActiveMinutes",
                 "VeryActiveMinutes",
                 "Calories")])
##                      SedentaryMinutes LightlyActiveMinutes FairlyActiveMinutes
## SedentaryMinutes            1.0000000          -0.47408754          -0.2347354
## LightlyActiveMinutes       -0.4740875           1.00000000           0.1446884
## FairlyActiveMinutes        -0.2347354           0.14468837           1.0000000
## VeryActiveMinutes          -0.1402851           0.05150525           0.2784322
## Calories                   -0.1093503           0.25941715           0.2715836
##                      VeryActiveMinutes   Calories
## SedentaryMinutes           -0.14028509 -0.1093503
## LightlyActiveMinutes        0.05150525  0.2594172
## FairlyActiveMinutes         0.27843223  0.2715836
## VeryActiveMinutes           1.00000000  0.5989353
## Calories                    0.59893527  1.0000000

Very active minutes is the activity most closely correlated to calories burned

3. What type of activity is the most common among users?

dailyDataLong %>%
  group_by(ActivityTypeMin) %>%
  summarise(AvgHours = mean(Minutes)/60) %>%
  ggplot(aes(x=ActivityTypeMin, y=AvgHours,
             fill=factor(ifelse(ActivityTypeMin == "Lightly Active",
                                "Highlighted","Normal")))) +
  geom_col() +
  theme(axis.title.y = element_blank(),
        legend.position = "none") +
  labs(title = "Average hours spent on each activity",
       subtitle = "Among the active options, light activity is the most common") +
  scale_y_continuous(labels = scales::comma) +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        axis.line = element_line(colour = "black")) +
  coord_flip() +
  ylab("Hours") +
  geom_text(aes(label=round(AvgHours,1)), hjust=-0.1, size=3.5)

4. What is the behaviour of the top performers compared to the average performers?

In order to segment users, we will check the distribution of total calories burned by user

# Summarize the total calories by user

UserCals <- dailyData %>%
  group_by(Id) %>%
  summarise(TotalCal = sum(Calories))

# Visualize the histogram for calories
hist(UserCals$TotalCal)

# Identify quantile numbers
quantile(UserCals$TotalCal)
##     0%    25%    50%    75%   100% 
##   7895  53449  62911  78524 106534

Users with calories above the Upper Quartile (75%) will be defined as Top Performers

Users with calories within the Interquartile Range will be defined as Average Performers

Users with calories below the Lower Quartile (25%) will be defined as Low Performers

Create a new data frame that contains the “type of user” variable

# Identify the users

UserCals <- UserCals %>%
  mutate(UserType = case_when(TotalCal > quantile(UserCals$TotalCal)[4] ~ "Top Performer",
                              TotalCal >= quantile(UserCals$TotalCal)[2] ~ "Avg Performer",
                              TotalCal < quantile(UserCals$TotalCal)[2] ~ "Low Performer"),
         UserType = factor(UserType, levels = c("Low Performer", "Avg Performer",
                                                   "Top Performer")))

# Get the user type variable from the previous dataset into Daily Datasets
dailyData <- dailyData %>%
  inner_join(UserCals, by="Id")

dailyDataLong <- dailyDataLong %>%
  inner_join(UserCals, by="Id")

Visualize the average time spent in activity (not sedentary) per weekday

dailyData %>%
  group_by(UserType, Weekday) %>%
  summarise(AvgHours = mean((TotalMinutes - SedentaryMinutes)/60)) %>%
  ggplot(aes(x=Weekday, y=AvgHours, fill=UserType)) +
  geom_col(position = "dodge") +
  theme(axis.title.x = element_blank()) +
  labs(title = "Average time in activity throughout the week",
       subtitle = "No major differences on weekdays, but during the weekend the top performers have a higher activity time") +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        axis.line = element_line(colour = "black")) +
  ylab("Hours (avg)") +
  scale_fill_manual(values=c("#E3E1E2", "#C7C5C6", "#21ABCD")) +
  geom_segment(aes(x = 8, y = 4, xend = 7.5, yend = 3.5),
               arrow = arrow(length = unit(0.3, "cm"))) +
  geom_segment(aes(x = 7, y = 4.5, xend = 6.5, yend = 4),
               arrow = arrow(length = unit(0.3, "cm")))
## `summarise()` has grouped output by 'UserType'. You can override using the `.groups` argument.

There is not a significant different between the average time spend on the week

Let’s see what are the differences regarding the type of activity performed

dailyDataLong %>%
  mutate(PercActivity = Minutes / TotalMinutes) %>%
  group_by(UserType, ActivityTypeMin) %>%
  summarise(AvgPercAct = round(mean(PercActivity),2)) %>%
  ggplot(aes(x=ActivityTypeMin, y=AvgPercAct, fill=UserType)) +
  geom_col(position = "dodge") +
  theme(axis.title.x = element_blank()) +
  labs(title = "Average time in activity throughout the week",
       subtitle = "No major differences among the different type of users") +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        axis.line = element_line(colour = "black")) +
  ylab("Time Percentage") +
  scale_fill_manual(values=c("#E3E1E2", "#C7C5C6", "#21ABCD")) +
  scale_y_continuous(labels = scales::percent) +
  geom_text(aes(label=scales::percent(AvgPercAct), x=ActivityTypeMin, y=AvgPercAct+0.03),
            position = position_dodge(width = 0.9), size = 3)
## `summarise()` has grouped output by 'UserType'. You can override using the `.groups` argument.

As seen in the previous visualization, the main difference between top performers and the other users is the time spent on High-intensity activity, which seems to be the main component of calories consumption

See below the average differences in time spent on high-intensity activity

dailyDataLong %>%
  filter(ActivityTypeMin == "Very Active") %>%
  group_by(UserType, Weekday) %>%
  summarise(Hours = mean(Minutes/60)) %>%
  ggplot(aes(x=Weekday, y=Hours, fill=UserType)) +
  geom_col(position = "dodge") +
  theme(axis.title.x = element_blank()) +
  labs(title="Time spent (hours) on high-intensity activity ",
       subtitle="Top performers spend about twice as much time on high-intensity activities") +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        axis.line = element_line(colour = "black")) +
  ylab("Time (h)") +
  scale_fill_manual(values=c("#E3E1E2", "#C7C5C6", "#21ABCD"))
## `summarise()` has grouped output by 'UserType'. You can override using the `.groups` argument.

5. What is the sleep behaviour among users?

dailyData %>%
  filter(!is.na(TotalMinutesAsleep)) %>%
  group_by(Weekday, UserType) %>%
  summarise(AvgSleepHrs = mean(TotalMinutesAsleep/60)) %>%
  ggplot(aes(x=Weekday, y=AvgSleepHrs, fill=UserType)) +
  geom_col(position = "dodge") +
  theme(axis.title.x = element_blank()) +
  labs(title="Sleep time (hours) by user type",
       subtitle="There is no significant difference between users.\nIn general, users sleep around 7 hours and only top performers sleep more than 8 hours on Sunday.") +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        axis.line = element_line(colour = "black")) +
  ylab("Sleep Time (h)") +
  scale_fill_manual(values=c("#E3E1E2", "#C7C5C6", "#21ABCD"))
## `summarise()` has grouped output by 'Weekday'. You can override using the `.groups` argument.

STEP 4: CONCLUSIONS

What are some trends in smart device usage?

  • Tuesday and Wednesday are the days with more activity
  • High-intensity activity has the higher correlation to calories burned
  • Light-activity is the most common
  • Top performers (those with the highest calorie burn) are more active during the weekends than the rest of users
  • Top performers spend about twice as much time on very active exercise
  • Top performers spend around 40 minutes per day on very active exercise
  • Users sleep around 7 hours per day

How could these trends apply to Bellabeat customers?

  • The app can be used to encourage Bellabeat customers to perform very active exercise if they are looking to burn more calories in less time
  • Goals can be set regarding “very active” time and sleep time

How could these trends help influence Bellabeat marketing strategy?

  • By tracking users’ data, we can help them define the most effective ways to burn calories and stay healthy.

  • With the information gathered, we can set goals that are related to a healthier life: at least 7 hours of sleep and a minimum of 30min of very active exercise