Introduction
This analysis is for case study #2 from the Google Data Analytics Certificate (Bellabeat).
It’s originally based on the case study “FitBit Fitness Tracker Data” by Möbius found here.
The purpose of this script is to clean and consolidate data to conduct an analysis that aims to get insights into how consumers are using their smart data. These insights will help guide the marketing strategy for the company.
Questions to be answered:
- What are some trends in smart device usage?
- How could these trends apply to Bellabeat customers?
- How could these trends help influence Bellabeat marketing strategy?
Packages used for the cleaning, analysis and visualization
- tidyverse for data import and wrangling
- lubridate for date functions
- ggplot for visualization
- janitor for data cleaning
- hms for time variables handling
STEP 1: COLLECT DATA
Load csv files into different data frames
# Loop through csv files in directory and load to corresponding data frames
for (data in list.files(path = "Data",pattern="*.csv")) {
<- strsplit(data, "[_]")[[1]][1]
name assign(name, read_csv(paste("data",data, sep = "/")))
}# Remove temporary variables created in the loop
rm(data)
rm(name)
Data overview
We will analyze the daily information corresponding to:
- Daily Activity
- Sleep
- Weight
# Check sample of each data set
head(dailyActivity)
## # A tibble: 6 x 15
## Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie~
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2016 13162 8.5 8.5 0
## 2 1.50e9 4/13/2016 10735 6.97 6.97 0
## 3 1.50e9 4/14/2016 10460 6.74 6.74 0
## 4 1.50e9 4/15/2016 9762 6.28 6.28 0
## 5 1.50e9 4/16/2016 12669 8.16 8.16 0
## 6 1.50e9 4/17/2016 9705 6.48 6.48 0
## # ... with 9 more variables: VeryActiveDistance <dbl>,
## # ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## # SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## # FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## # SedentaryMinutes <dbl>, Calories <dbl>
head(sleepDay)
## # A tibble: 6 x 5
## Id SleepDay TotalSleepRecor~ TotalMinutesAsle~ TotalTimeInBed
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2016 12:00:0~ 1 327 346
## 2 1.50e9 4/13/2016 12:00:0~ 2 384 407
## 3 1.50e9 4/15/2016 12:00:0~ 1 412 442
## 4 1.50e9 4/16/2016 12:00:0~ 2 340 367
## 5 1.50e9 4/17/2016 12:00:0~ 1 700 712
## 6 1.50e9 4/19/2016 12:00:0~ 1 304 320
head(weightLogInfo)
## # A tibble: 6 x 8
## Id Date WeightKg WeightPounds Fat BMI IsManualReport LogId
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl>
## 1 1.50e9 5/2/2016 ~ 52.6 116. 22 22.6 TRUE 1.46e12
## 2 1.50e9 5/3/2016 ~ 52.6 116. NA 22.6 TRUE 1.46e12
## 3 1.93e9 4/13/2016~ 134. 294. NA 47.5 FALSE 1.46e12
## 4 2.87e9 4/21/2016~ 56.7 125. NA 21.5 TRUE 1.46e12
## 5 2.87e9 5/12/2016~ 57.3 126. NA 21.7 TRUE 1.46e12
## 6 4.32e9 4/17/2016~ 72.4 160. 25 27.5 TRUE 1.46e12
# Check number of users
length(unique(dailyActivity$Id)) #33
## [1] 33
length(unique(sleepDay$Id)) #24
## [1] 24
length(unique(weightLogInfo$Id)) #8
## [1] 8
Limitation: There is no sleep and weight information for all 33 users
STEP 2: WRANGLE DATA AND COMBINE INTO OTHER FILES
1. Daily Activity
Let’s start cleaning the Daily Activity dataset
# Data frame structure
str(dailyActivity)
## spec_tbl_df [940 x 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDate : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ TotalSteps : num [1:940] 13162 10735 10460 9762 12669 ...
## $ TotalDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : num [1:940] 728 776 1218 726 773 ...
## $ Calories : num [1:940] 1985 1797 1776 1745 1863 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. ActivityDate = col_character(),
## .. TotalSteps = col_double(),
## .. TotalDistance = col_double(),
## .. TrackerDistance = col_double(),
## .. LoggedActivitiesDistance = col_double(),
## .. VeryActiveDistance = col_double(),
## .. ModeratelyActiveDistance = col_double(),
## .. LightActiveDistance = col_double(),
## .. SedentaryActiveDistance = col_double(),
## .. VeryActiveMinutes = col_double(),
## .. FairlyActiveMinutes = col_double(),
## .. LightlyActiveMinutes = col_double(),
## .. SedentaryMinutes = col_double(),
## .. Calories = col_double()
## .. )
# Data frame sample
head(dailyActivity)
## # A tibble: 6 x 15
## Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie~
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2016 13162 8.5 8.5 0
## 2 1.50e9 4/13/2016 10735 6.97 6.97 0
## 3 1.50e9 4/14/2016 10460 6.74 6.74 0
## 4 1.50e9 4/15/2016 9762 6.28 6.28 0
## 5 1.50e9 4/16/2016 12669 8.16 8.16 0
## 6 1.50e9 4/17/2016 9705 6.48 6.48 0
## # ... with 9 more variables: VeryActiveDistance <dbl>,
## # ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## # SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## # FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## # SedentaryMinutes <dbl>, Calories <dbl>
Cleaning steps:
1.1 Activity Date is a string, convert to date
$ActivityDate <- as.Date(dailyActivity$ActivityDate, format = "%m/%d/%Y") dailyActivity
1.2. Check if total distance and tracker distance are the same to remove one of them
sum(dailyActivity$TotalDistance == dailyActivity$TrackerDistance) #925
## [1] 925
sum(dailyActivity$TotalDistance != dailyActivity$TrackerDistance) #15
## [1] 15
nrow(dailyActivity)
## [1] 940
# Check those rows
$TotalDistance != dailyActivity$TrackerDistance, ] dailyActivity[dailyActivity
## # A tibble: 15 x 15
## Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie~
## <dbl> <date> <dbl> <dbl> <dbl> <dbl>
## 1 6.96e9 2016-04-21 11835 9.71 7.88 4.08
## 2 6.96e9 2016-04-25 13239 9.27 9.08 2.79
## 3 6.96e9 2016-05-09 12342 8.72 8.68 3.17
## 4 7.01e9 2016-04-12 14172 10.3 9.48 4.87
## 5 7.01e9 2016-04-13 12862 9.65 8.60 4.85
## 6 7.01e9 2016-04-14 11179 8.24 7.48 3.29
## 7 7.01e9 2016-04-18 14816 11.0 9.91 4.93
## 8 7.01e9 2016-04-19 14194 10.5 9.5 4.94
## 9 7.01e9 2016-04-20 15566 11.3 10.4 4.92
## 10 7.01e9 2016-04-25 18229 13.3 12.2 4.86
## 11 7.01e9 2016-04-27 13541 10.2 9.06 4.89
## 12 7.01e9 2016-04-29 20067 14.3 13.4 4.91
## 13 7.01e9 2016-05-02 13041 9.18 8.72 2.83
## 14 7.01e9 2016-05-03 14510 10.9 9.71 4.91
## 15 7.01e9 2016-05-05 15010 11.1 10.0 4.88
## # ... with 9 more variables: VeryActiveDistance <dbl>,
## # ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## # SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## # FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## # SedentaryMinutes <dbl>, Calories <dbl>
Since there are just a few rows (15/940) where the values of Total Distance and Tracker distance are different, and there is not enough information about the meaning of them, we will remove the Tracker Distance and work only with Total Distance.
$TrackerDistance <- NULL dailyActivity
1.3. Check if Logged Activities Distance is relevant
table(dailyActivity$LoggedActivitiesDistance)
##
## 0 1.95959603786469 2.0921471118927 2.25308108329773
## 908 1 9 7
## 2.78517508506775 2.83232593536377 3.16782188415527 3.28541493415833
## 1 1 1 1
## 4.08169221878052 4.8513069152832 4.86179208755493 4.8697829246521
## 1 1 1 1
## 4.8782320022583 4.88560485839844 4.91114616394043 4.91236782073975
## 1 1 1 1
## 4.92484092712402 4.93055009841919 4.94214200973511
## 1 1 1
908/940 rows have a value of 0 for Logged Activities Distance
It will not be considered for the analysis
$LoggedActivitiesDistance <- NULL dailyActivity
1.4. Identify and remove outliers in daily calories
summary(dailyActivity$Calories)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 1828 2134 2304 2793 4900
boxplot(dailyActivity$Calories, horizontal = T)
# Store outliers in variable
<- boxplot(dailyActivity$Calories, horizontal = T)$out cal_outliers
# How many rows to remove?
nrow(dailyActivity[dailyActivity$Calories %in% cal_outliers, "Calories"]) # 16
## [1] 16
# Remove rows with outliers in calories
<- dailyActivity[!dailyActivity$Calories %in% cal_outliers, ]
dailyActivity
rm(cal_outliers)
1.5. Look for missing values
sum(is.na(dailyActivity)) # No missing values
## [1] 0
2. Daily Sleep
Now clean the Sleep dataset
# Data frame structure
str(sleepDay)
## spec_tbl_df [413 x 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:413] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ SleepDay : chr [1:413] "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
## $ TotalSleepRecords : num [1:413] 1 2 1 2 1 1 1 1 1 1 ...
## $ TotalMinutesAsleep: num [1:413] 327 384 412 340 700 304 360 325 361 430 ...
## $ TotalTimeInBed : num [1:413] 346 407 442 367 712 320 377 364 384 449 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. SleepDay = col_character(),
## .. TotalSleepRecords = col_double(),
## .. TotalMinutesAsleep = col_double(),
## .. TotalTimeInBed = col_double()
## .. )
# Data frame sample
head(sleepDay)
## # A tibble: 6 x 5
## Id SleepDay TotalSleepRecor~ TotalMinutesAsle~ TotalTimeInBed
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2016 12:00:0~ 1 327 346
## 2 1.50e9 4/13/2016 12:00:0~ 2 384 407
## 3 1.50e9 4/15/2016 12:00:0~ 1 412 442
## 4 1.50e9 4/16/2016 12:00:0~ 2 340 367
## 5 1.50e9 4/17/2016 12:00:0~ 1 700 712
## 6 1.50e9 4/19/2016 12:00:0~ 1 304 320
Cleaning steps:
2.1. Sleep Day is a string, convert to date and change the name
$SleepDay <- as.Date(sleepDay$SleepDay, format = "%m/%d/%Y")
sleepDay<- rename(sleepDay, Date = SleepDay) sleepDay
2.2. Remove sleep records since it is not significant for the analysis
$TotalSleepRecords <- NULL sleepDay
2.3. Look for missing values
sum(is.na(sleepDay)) # No missing values
## [1] 0
3. Weight
Lastly, clean the Weight dataset
# Data frame structure
str(weightLogInfo)
## spec_tbl_df [67 x 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:67] 1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
## $ Date : chr [1:67] "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
## $ WeightKg : num [1:67] 52.6 52.6 133.5 56.7 57.3 ...
## $ WeightPounds : num [1:67] 116 116 294 125 126 ...
## $ Fat : num [1:67] 22 NA NA NA NA 25 NA NA NA NA ...
## $ BMI : num [1:67] 22.6 22.6 47.5 21.5 21.7 ...
## $ IsManualReport: logi [1:67] TRUE TRUE FALSE TRUE TRUE TRUE ...
## $ LogId : num [1:67] 1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. Date = col_character(),
## .. WeightKg = col_double(),
## .. WeightPounds = col_double(),
## .. Fat = col_double(),
## .. BMI = col_double(),
## .. IsManualReport = col_logical(),
## .. LogId = col_double()
## .. )
# Data frame sample
head(weightLogInfo)
## # A tibble: 6 x 8
## Id Date WeightKg WeightPounds Fat BMI IsManualReport LogId
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl>
## 1 1.50e9 5/2/2016 ~ 52.6 116. 22 22.6 TRUE 1.46e12
## 2 1.50e9 5/3/2016 ~ 52.6 116. NA 22.6 TRUE 1.46e12
## 3 1.93e9 4/13/2016~ 134. 294. NA 47.5 FALSE 1.46e12
## 4 2.87e9 4/21/2016~ 56.7 125. NA 21.5 TRUE 1.46e12
## 5 2.87e9 5/12/2016~ 57.3 126. NA 21.7 TRUE 1.46e12
## 6 4.32e9 4/17/2016~ 72.4 160. 25 27.5 TRUE 1.46e12
Cleaning steps:
3.1. Date is a string, convert to date
$Date <- as.Date(weightLogInfo$Date, format = "%m/%d/%Y") weightLogInfo
3.2. Remove not significant columns for the analysis
$IsManualReport <- NULL
weightLogInfo$LogId <- NULL weightLogInfo
3.3. Look for missing values
sum(is.na(weightLogInfo)) # 65 NAs
## [1] 65
#Check the columns where NAs exist
sum(is.na(weightLogInfo$Id)) # 0
## [1] 0
sum(is.na(weightLogInfo$Date)) # 0
## [1] 0
sum(is.na(weightLogInfo$WeightKg)) # 0
## [1] 0
sum(is.na(weightLogInfo$WeightPounds)) # 0
## [1] 0
sum(is.na(weightLogInfo$Fat)) # 65
## [1] 65
sum(is.na(weightLogInfo$BMI)) # 0
## [1] 0
sum(is.na(weightLogInfo$Fat)) / nrow(weightLogInfo) # 97% of data is missing, remove variable
## [1] 0.9701493
$Fat <- NULL weightLogInfo
3. Combine the datasets
We will use the ID and the Date to join the three tables
First, we need to validate that the join will be clean
# Rename ActivityDate to Date
<- rename(dailyActivity, Date = ActivityDate)
dailyActivity
nrow(dailyActivity) # 924
## [1] 924
%>%
dailyActivity left_join(sleepDay, by = c("Id", "Date")) %>%
left_join(weightLogInfo, by = c("Id", "Date")) %>%
tally() # 927
## # A tibble: 1 x 1
## n
## <int>
## 1 927
# Check for duplicated registers
sum(duplicated(dailyActivity)) # 0
## [1] 0
sum(duplicated(sleepDay)) # 3
## [1] 3
sum(duplicated(weightLogInfo)) # 0
## [1] 0
# Remove the duplicated registers
<- sleepDay[!duplicated(sleepDay), ] sleepDay
Perform the join and save the result in a new variable
<- dailyActivity %>%
dailyData left_join(sleepDay, by = c("Id", "Date")) %>%
left_join(weightLogInfo, by = c("Id", "Date"))
4. Model the dataframe
4.1. Enrich the data
# Date enrichment
<- dailyData %>%
dailyData mutate(Day = day(Date),
Weekday = wday(Date, label = TRUE, abbr = FALSE, week_start = 1),
Month = month(Date, label = TRUE),
Year = year(Date))
# Validate Total Distance = SedentaryActiveDistance + LightActiveDistance + ModeratelyActiveDistance + VeryActiveDistance
%>%
dailyData mutate(SumOfDistances = SedentaryActiveDistance + LightActiveDistance + ModeratelyActiveDistance + VeryActiveDistance) %>%
filter(TotalDistance != SumOfDistances) %>%
select(Id, Date, TotalDistance, SumOfDistances)
## # A tibble: 628 x 4
## Id Date TotalDistance SumOfDistances
## <dbl> <date> <dbl> <dbl>
## 1 1503960366 2016-04-12 8.5 8.49
## 2 1503960366 2016-04-13 6.97 6.97
## 3 1503960366 2016-04-14 6.74 6.75
## 4 1503960366 2016-04-15 6.28 6.23
## 5 1503960366 2016-04-16 8.16 8.16
## 6 1503960366 2016-04-17 6.48 6.48
## 7 1503960366 2016-04-18 8.59 8.60
## 8 1503960366 2016-04-19 9.88 9.88
## 9 1503960366 2016-04-20 6.68 6.68
## 10 1503960366 2016-04-21 6.34 6.34
## # ... with 618 more rows
# The differences are not significant (<0.01)
# Add total minutes column
<- dailyData %>%
dailyData mutate(TotalMinutes = SedentaryMinutes + LightlyActiveMinutes +
+ VeryActiveMinutes) FairlyActiveMinutes
4.2. To facilitate the analysis of the data, let’s create a long version of it
<- dailyData %>%
dailyDataLong gather(ActivityTypeDistance, Distance,
c("SedentaryActiveDistance","LightActiveDistance",
"ModeratelyActiveDistance","VeryActiveDistance")) %>%
gather(ActivityTypeMin, Minutes,
c("SedentaryMinutes","LightlyActiveMinutes",
"FairlyActiveMinutes","VeryActiveMinutes")) %>%
arrange(Id, Date) %>%
mutate(ActivityTypeDistance = as.factor(ActivityTypeDistance),
ActivityTypeMin = as.factor(ActivityTypeMin))
# Redefine levels for distance
levels(dailyDataLong$ActivityTypeDistance) <- c("Lightly Active", "Moderately Active", "Sedentary Active", "Very Active")
$ActivityTypeDistance <- factor(dailyDataLong$ActivityTypeDistance,
dailyDataLonglevels = c("Sedentary Active", "Lightly Active",
"Moderately Active", "Very Active"))
# Redefine levels for minutes
levels(dailyDataLong$ActivityTypeMin) <- c("Moderately Active", "Lightly Active", "Sedentary Active", "Very Active")
$ActivityTypeMin <- factor(dailyDataLong$ActivityTypeMin,
dailyDataLonglevels = c("Sedentary Active", "Lightly Active",
"Moderately Active", "Very Active"))
4.3. Check Activity type for distance vs. for minutes
Are ActivityTypeDistance and ActivityTypeMin the same?
%>%
dailyDataLong filter(ActivityTypeDistance != ActivityTypeMin) %>%
select(Id, Date, ActivityTypeDistance, ActivityTypeMin)
## # A tibble: 11,088 x 4
## Id Date ActivityTypeDistance ActivityTypeMin
## <dbl> <date> <fct> <fct>
## 1 1503960366 2016-04-12 Lightly Active Sedentary Active
## 2 1503960366 2016-04-12 Moderately Active Sedentary Active
## 3 1503960366 2016-04-12 Very Active Sedentary Active
## 4 1503960366 2016-04-12 Sedentary Active Lightly Active
## 5 1503960366 2016-04-12 Moderately Active Lightly Active
## 6 1503960366 2016-04-12 Very Active Lightly Active
## 7 1503960366 2016-04-12 Sedentary Active Moderately Active
## 8 1503960366 2016-04-12 Lightly Active Moderately Active
## 9 1503960366 2016-04-12 Very Active Moderately Active
## 10 1503960366 2016-04-12 Sedentary Active Very Active
## # ... with 11,078 more rows
No, we can have a light activity regarding distance, but moderate for minutes
More information needed to completely understand how these values are defined
4.4. Calculate the percentage of missing values in each column
colMeans(is.na(dailyData))
## Id Date TotalSteps
## 0.0000000 0.0000000 0.0000000
## TotalDistance VeryActiveDistance ModeratelyActiveDistance
## 0.0000000 0.0000000 0.0000000
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 0.0000000 0.0000000 0.0000000
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
## 0.0000000 0.0000000 0.0000000
## Calories TotalMinutesAsleep TotalTimeInBed
## 0.0000000 0.5638528 0.5638528
## WeightKg WeightPounds BMI
## 0.9307359 0.9307359 0.9307359
## Day Weekday Month
## 0.0000000 0.0000000 0.0000000
## Year TotalMinutes
## 0.0000000 0.0000000
Not enough data for Weight and BMI
Almost half of the Sleep data is missing
STEP 3: CONDUCT DESCRIPTIVE ANALYSIS
1. Let’s check the behavior of Total Distance and Total minutes throughout the week
Total Distance
%>%
dailyData group_by(Weekday) %>%
summarise(TotalDistance = sum(TotalDistance)) %>%
ggplot(aes(x=Weekday, y=TotalDistance,
fill=factor(ifelse(Weekday %in% c("Tuesday", "Wednesday"),
"Highlighted","Normal")))) +
geom_col() +
theme(axis.title.x = element_blank(),
legend.position = "none") +
ylab("Total Distance") +
labs(title = "Total distance per weekday",
subtitle = "Tuesday and Wednesday are the days people travel the longest distances") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"))
Total Minutes
%>%
dailyData group_by(Weekday) %>%
summarise(TotalMinutes = sum(TotalMinutes)) %>%
ggplot(aes(x=Weekday, y=TotalMinutes,
fill=factor(ifelse(Weekday %in% c("Tuesday", "Wednesday"),
"Highlighted","Normal")))) +
geom_col() +
theme(axis.title.x = element_blank(),
legend.position = "none") +
ylab("Total Minutes") +
labs(title = "Total Minutes per weekday",
subtitle = "Tuesday and Wednesday are the days people spend more time in activity") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"))
2. Look for correlation between the variables
ggplot(data=dailyData, aes(x=TotalMinutesAsleep, y=Calories)) + geom_point()
cor(dailyData[!is.na(dailyData$TotalMinutesAsleep),c("TotalMinutesAsleep", "Calories")]) #-0.0601
## TotalMinutesAsleep Calories
## TotalMinutesAsleep 1.00000000 -0.06017452
## Calories -0.06017452 1.00000000
# What variable is more related to calories burn?
ggplot(data=dailyData, aes(x=TotalSteps, y=Calories)) + geom_point()
cor(dailyData[,c("TotalSteps", "Calories")]) #0.5645
## TotalSteps Calories
## TotalSteps 1.0000000 0.5645245
## Calories 0.5645245 1.0000000
ggplot(data=dailyData, aes(x=TotalDistance, y=Calories)) + geom_point()
cor(dailyData[,c("TotalDistance", "Calories")]) #0.6262
## TotalDistance Calories
## TotalDistance 1.0000000 0.6261985
## Calories 0.6261985 1.0000000
ggplot(data=dailyData, aes(x=TotalMinutes, y=Calories)) + geom_point()
cor(dailyData[,c("TotalMinutes", "Calories")]) #0.7614
## TotalMinutes Calories
## TotalMinutes 1.00000000 0.07613576
## Calories 0.07613576 1.00000000
The minutes of activity are more correlated to calories burned
Now let’s see which type of activity is more correlated to calories burned
# What type of activity is more related to calories burn?
# Distances
cor(dailyData[,c("SedentaryActiveDistance",
"LightActiveDistance",
"ModeratelyActiveDistance",
"VeryActiveDistance",
"Calories")])
## SedentaryActiveDistance LightActiveDistance
## SedentaryActiveDistance 1.00000000 0.09930055
## LightActiveDistance 0.09930055 1.00000000
## ModeratelyActiveDistance 0.00558149 0.23445082
## VeryActiveDistance 0.05280515 0.14116607
## Calories 0.04767849 0.45853924
## ModeratelyActiveDistance VeryActiveDistance Calories
## SedentaryActiveDistance 0.00558149 0.05280515 0.04767849
## LightActiveDistance 0.23445082 0.14116607 0.45853924
## ModeratelyActiveDistance 1.00000000 0.19355232 0.20527455
## VeryActiveDistance 0.19355232 1.00000000 0.47511959
## Calories 0.20527455 0.47511959 1.00000000
#Minutes
cor(dailyData[,c("SedentaryMinutes",
"LightlyActiveMinutes",
"FairlyActiveMinutes",
"VeryActiveMinutes",
"Calories")])
## SedentaryMinutes LightlyActiveMinutes FairlyActiveMinutes
## SedentaryMinutes 1.0000000 -0.47408754 -0.2347354
## LightlyActiveMinutes -0.4740875 1.00000000 0.1446884
## FairlyActiveMinutes -0.2347354 0.14468837 1.0000000
## VeryActiveMinutes -0.1402851 0.05150525 0.2784322
## Calories -0.1093503 0.25941715 0.2715836
## VeryActiveMinutes Calories
## SedentaryMinutes -0.14028509 -0.1093503
## LightlyActiveMinutes 0.05150525 0.2594172
## FairlyActiveMinutes 0.27843223 0.2715836
## VeryActiveMinutes 1.00000000 0.5989353
## Calories 0.59893527 1.0000000
Very active minutes is the activity most closely correlated to calories burned
3. What type of activity is the most common among users?
%>%
dailyDataLong group_by(ActivityTypeMin) %>%
summarise(AvgHours = mean(Minutes)/60) %>%
ggplot(aes(x=ActivityTypeMin, y=AvgHours,
fill=factor(ifelse(ActivityTypeMin == "Lightly Active",
"Highlighted","Normal")))) +
geom_col() +
theme(axis.title.y = element_blank(),
legend.position = "none") +
labs(title = "Average hours spent on each activity",
subtitle = "Among the active options, light activity is the most common") +
scale_y_continuous(labels = scales::comma) +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black")) +
coord_flip() +
ylab("Hours") +
geom_text(aes(label=round(AvgHours,1)), hjust=-0.1, size=3.5)
4. What is the behaviour of the top performers compared to the average performers?
In order to segment users, we will check the distribution of total calories burned by user
# Summarize the total calories by user
<- dailyData %>%
UserCals group_by(Id) %>%
summarise(TotalCal = sum(Calories))
# Visualize the histogram for calories
hist(UserCals$TotalCal)
# Identify quantile numbers
quantile(UserCals$TotalCal)
## 0% 25% 50% 75% 100%
## 7895 53449 62911 78524 106534
Users with calories above the Upper Quartile (75%) will be defined as Top Performers
Users with calories within the Interquartile Range will be defined as Average Performers
Users with calories below the Lower Quartile (25%) will be defined as Low Performers
Create a new data frame that contains the “type of user” variable
# Identify the users
<- UserCals %>%
UserCals mutate(UserType = case_when(TotalCal > quantile(UserCals$TotalCal)[4] ~ "Top Performer",
>= quantile(UserCals$TotalCal)[2] ~ "Avg Performer",
TotalCal < quantile(UserCals$TotalCal)[2] ~ "Low Performer"),
TotalCal UserType = factor(UserType, levels = c("Low Performer", "Avg Performer",
"Top Performer")))
# Get the user type variable from the previous dataset into Daily Datasets
<- dailyData %>%
dailyData inner_join(UserCals, by="Id")
<- dailyDataLong %>%
dailyDataLong inner_join(UserCals, by="Id")
Visualize the average time spent in activity (not sedentary) per weekday
%>%
dailyData group_by(UserType, Weekday) %>%
summarise(AvgHours = mean((TotalMinutes - SedentaryMinutes)/60)) %>%
ggplot(aes(x=Weekday, y=AvgHours, fill=UserType)) +
geom_col(position = "dodge") +
theme(axis.title.x = element_blank()) +
labs(title = "Average time in activity throughout the week",
subtitle = "No major differences on weekdays, but during the weekend the top performers have a higher activity time") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black")) +
ylab("Hours (avg)") +
scale_fill_manual(values=c("#E3E1E2", "#C7C5C6", "#21ABCD")) +
geom_segment(aes(x = 8, y = 4, xend = 7.5, yend = 3.5),
arrow = arrow(length = unit(0.3, "cm"))) +
geom_segment(aes(x = 7, y = 4.5, xend = 6.5, yend = 4),
arrow = arrow(length = unit(0.3, "cm")))
## `summarise()` has grouped output by 'UserType'. You can override using the `.groups` argument.
There is not a significant different between the average time spend on the week
Let’s see what are the differences regarding the type of activity performed
%>%
dailyDataLong mutate(PercActivity = Minutes / TotalMinutes) %>%
group_by(UserType, ActivityTypeMin) %>%
summarise(AvgPercAct = round(mean(PercActivity),2)) %>%
ggplot(aes(x=ActivityTypeMin, y=AvgPercAct, fill=UserType)) +
geom_col(position = "dodge") +
theme(axis.title.x = element_blank()) +
labs(title = "Average time in activity throughout the week",
subtitle = "No major differences among the different type of users") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black")) +
ylab("Time Percentage") +
scale_fill_manual(values=c("#E3E1E2", "#C7C5C6", "#21ABCD")) +
scale_y_continuous(labels = scales::percent) +
geom_text(aes(label=scales::percent(AvgPercAct), x=ActivityTypeMin, y=AvgPercAct+0.03),
position = position_dodge(width = 0.9), size = 3)
## `summarise()` has grouped output by 'UserType'. You can override using the `.groups` argument.
As seen in the previous visualization, the main difference between top performers and the other users is the time spent on High-intensity activity, which seems to be the main component of calories consumption
See below the average differences in time spent on high-intensity activity
%>%
dailyDataLong filter(ActivityTypeMin == "Very Active") %>%
group_by(UserType, Weekday) %>%
summarise(Hours = mean(Minutes/60)) %>%
ggplot(aes(x=Weekday, y=Hours, fill=UserType)) +
geom_col(position = "dodge") +
theme(axis.title.x = element_blank()) +
labs(title="Time spent (hours) on high-intensity activity ",
subtitle="Top performers spend about twice as much time on high-intensity activities") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black")) +
ylab("Time (h)") +
scale_fill_manual(values=c("#E3E1E2", "#C7C5C6", "#21ABCD"))
## `summarise()` has grouped output by 'UserType'. You can override using the `.groups` argument.
5. What is the sleep behaviour among users?
%>%
dailyData filter(!is.na(TotalMinutesAsleep)) %>%
group_by(Weekday, UserType) %>%
summarise(AvgSleepHrs = mean(TotalMinutesAsleep/60)) %>%
ggplot(aes(x=Weekday, y=AvgSleepHrs, fill=UserType)) +
geom_col(position = "dodge") +
theme(axis.title.x = element_blank()) +
labs(title="Sleep time (hours) by user type",
subtitle="There is no significant difference between users.\nIn general, users sleep around 7 hours and only top performers sleep more than 8 hours on Sunday.") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black")) +
ylab("Sleep Time (h)") +
scale_fill_manual(values=c("#E3E1E2", "#C7C5C6", "#21ABCD"))
## `summarise()` has grouped output by 'Weekday'. You can override using the `.groups` argument.
STEP 4: CONCLUSIONS
What are some trends in smart device usage?
- Tuesday and Wednesday are the days with more activity
- High-intensity activity has the higher correlation to calories burned
- Light-activity is the most common
- Top performers (those with the highest calorie burn) are more active during the weekends than the rest of users
- Top performers spend about twice as much time on very active exercise
- Top performers spend around 40 minutes per day on very active exercise
- Users sleep around 7 hours per day
How could these trends apply to Bellabeat customers?
- The app can be used to encourage Bellabeat customers to perform very active exercise if they are looking to burn more calories in less time
- Goals can be set regarding “very active” time and sleep time
How could these trends help influence Bellabeat marketing strategy?
By tracking users’ data, we can help them define the most effective ways to burn calories and stay healthy.
With the information gathered, we can set goals that are related to a healthier life: at least 7 hours of sleep and a minimum of 30min of very active exercise