## Introduction and background Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused smart products. Sršen used her background as an artist to develop beautifully designed technology that informs and inspires women around the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits. Since it was founded in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women. By 2016, Bellabeat had opened offices around the world and launched multiple products. Bellabeat products became available through a growing number of online retailers in addition to their own e-commerce channel on their website. The company has invested in traditional advertising media, such as radio, out-of-home billboards, print, and television, but focuses on digital marketing extensively. Bellabeat invests year-round in Google Search, maintaining active Facebook and Instagram pages, and consistently engages consumers on Twitter. Additionally, Bellabeat runs video ads on Youtube and display ads on the Google Display Network to support campaigns around key marketing dates.
Key Stakeholders: Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer Sando Mur: Mathematician, Bellabeat’s cofounder and key member of the Bellabeat executive team Bellabeat marketing analytics team: A team of data analysts guiding Bellabeat’s marketing strategy.
Using the Case Study Roadmap as a guide, this analysis will follow the steps of the data analysis process: Ask, Prepare, Process, Analyze, Share, and Act.
Analyze the smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices and select one Bellabeat product to apply these insights to my presentation.
The co-founder and Chief Creative Officer encourages me to use “public data” that explores smart device users’ daily habits and points me to a specific Kaggle data set. Now I prepared the data for analysis using the “Case Study Roadmap” as a guide:
Reliable — LOW — Not reliable as it only has 30 respondents Original — LOW — Third party provider (Amazon Mechanical Turk) Comprehensive — MED — Parameters match most of Bellabeat product´s parameters Current — LOW — Data is 5 years old and may not be relevant Cited — LOW — Data collected from third party, hence unknown “Overall, this dataset is considered “bad quality data” and it is not recommended to produce business recommendations based on this data”
Process the data by cleaning and ensuring that it is correct, relevant, complete and free of error and outlier.
I upload the CSV files to my project from the relevant data source:
https://www.kaggle.com/arashnic/fitbit There are many different CSV files in this dataset, but I decided to concentrate in two CSVs: “dailyActivity_merged.csv” and “sleepDay_merged.csv”
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(readr)
library(dplyr)
library(ggplot2)
library(knitr)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(rmarkdown)
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(skimr)
# Import dataset "dailyActivity_merged.csv"
daily_activity <- read_csv("C:\\Users\\MM\\OneDrive\\Documentos\\Fitabase Data 4.12.16-5.12.16\\dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Import dataset "sleepDay_merged.csv"
daily_sleep <- read_csv("C:\\Users\\MM\\OneDrive\\Documentos\\Fitabase Data 4.12.16-5.12.16\\sleepDay_merged.csv")
## Rows: 413 Columns: 5
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Explore and preview the first 10 rows of data
head(daily_activity, 10)
## # A tibble: 10 x 15
## Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie~
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 13162 8.5 8.5 0
## 2 1503960366 4/13/2016 10735 6.97 6.97 0
## 3 1503960366 4/14/2016 10460 6.74 6.74 0
## 4 1503960366 4/15/2016 9762 6.28 6.28 0
## 5 1503960366 4/16/2016 12669 8.16 8.16 0
## 6 1503960366 4/17/2016 9705 6.48 6.48 0
## 7 1503960366 4/18/2016 13019 8.59 8.59 0
## 8 1503960366 4/19/2016 15506 9.88 9.88 0
## 9 1503960366 4/20/2016 10544 6.68 6.68 0
## 10 1503960366 4/21/2016 9819 6.34 6.34 0
## # ... with 9 more variables: VeryActiveDistance <dbl>,
## # ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## # SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## # FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## # SedentaryMinutes <dbl>, Calories <dbl>
head(daily_sleep, 10)
## # A tibble: 10 x 5
## Id SleepDay TotalSleepRecor~ TotalMinutesAsl~ TotalTimeInBed
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 1 327 346
## 2 1503960366 4/13/2016 12:00:00 AM 2 384 407
## 3 1503960366 4/15/2016 12:00:00 AM 1 412 442
## 4 1503960366 4/16/2016 12:00:00 AM 2 340 367
## 5 1503960366 4/17/2016 12:00:00 AM 1 700 712
## 6 1503960366 4/19/2016 12:00:00 AM 1 304 320
## 7 1503960366 4/20/2016 12:00:00 AM 1 360 377
## 8 1503960366 4/21/2016 12:00:00 AM 1 325 364
## 9 1503960366 4/23/2016 12:00:00 AM 1 361 384
## 10 1503960366 4/24/2016 12:00:00 AM 1 430 449
# Familiarize with the data and column datatypes
str(daily_activity)
## spec_tbl_df [940 x 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDate : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ TotalSteps : num [1:940] 13162 10735 10460 9762 12669 ...
## $ TotalDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : num [1:940] 728 776 1218 726 773 ...
## $ Calories : num [1:940] 1985 1797 1776 1745 1863 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. ActivityDate = col_character(),
## .. TotalSteps = col_double(),
## .. TotalDistance = col_double(),
## .. TrackerDistance = col_double(),
## .. LoggedActivitiesDistance = col_double(),
## .. VeryActiveDistance = col_double(),
## .. ModeratelyActiveDistance = col_double(),
## .. LightActiveDistance = col_double(),
## .. SedentaryActiveDistance = col_double(),
## .. VeryActiveMinutes = col_double(),
## .. FairlyActiveMinutes = col_double(),
## .. LightlyActiveMinutes = col_double(),
## .. SedentaryMinutes = col_double(),
## .. Calories = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(daily_sleep)
## spec_tbl_df [413 x 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:413] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ SleepDay : chr [1:413] "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
## $ TotalSleepRecords : num [1:413] 1 2 1 2 1 1 1 1 1 1 ...
## $ TotalMinutesAsleep: num [1:413] 327 384 412 340 700 304 360 325 361 430 ...
## $ TotalTimeInBed : num [1:413] 346 407 442 367 712 320 377 364 384 449 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. SleepDay = col_character(),
## .. TotalSleepRecords = col_double(),
## .. TotalMinutesAsleep = col_double(),
## .. TotalTimeInBed = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
How many unique participants are there in each dataframe? It looks like there may be more participants in the daily activity dataset than the sleep dataset.
n_distinct(daily_activity$Id)
## [1] 33
n_distinct(daily_sleep$Id)
## [1] 24
# Check for missing values
sum(is.na(daily_activity))
## [1] 0
sum(is.na(daily_sleep))
## [1] 0
#Check for duplicates
sum(duplicated(daily_activity))
## [1] 0
sum(duplicated(daily_sleep))
## [1] 3
# Remove duplicates and NA from table 2
daily_sleep <- daily_sleep %>%
distinct() %>%
drop_na()
# Check duplicates were removed from tables
sum(duplicated(daily_activity))
## [1] 0
sum(duplicated(daily_sleep))
## [1] 0
# Cleaning dataset 1
clean_names(daily_activity)
## # A tibble: 940 x 15
## id activity_date total_steps total_distance tracker_distance
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 13162 8.5 8.5
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## 7 1503960366 4/18/2016 13019 8.59 8.59
## 8 1503960366 4/19/2016 15506 9.88 9.88
## 9 1503960366 4/20/2016 10544 6.68 6.68
## 10 1503960366 4/21/2016 9819 6.34 6.34
## # ... with 930 more rows, and 10 more variables:
## # logged_activities_distance <dbl>, very_active_distance <dbl>,
## # moderately_active_distance <dbl>, light_active_distance <dbl>,
## # sedentary_active_distance <dbl>, very_active_minutes <dbl>,
## # fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
## # sedentary_minutes <dbl>, calories <dbl>
# Cleaning dataset 2
clean_names(daily_sleep)
## # A tibble: 410 x 5
## id sleep_day total_sleep_rec~ total_minutes_a~ total_time_in_b~
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 1 327 346
## 2 1503960366 4/13/2016 12:00:00 AM 2 384 407
## 3 1503960366 4/15/2016 12:00:00 AM 1 412 442
## 4 1503960366 4/16/2016 12:00:00 AM 2 340 367
## 5 1503960366 4/17/2016 12:00:00 AM 1 700 712
## 6 1503960366 4/19/2016 12:00:00 AM 1 304 320
## 7 1503960366 4/20/2016 12:00:00 AM 1 360 377
## 8 1503960366 4/21/2016 12:00:00 AM 1 325 364
## 9 1503960366 4/23/2016 12:00:00 AM 1 361 384
## 10 1503960366 4/24/2016 12:00:00 AM 1 430 449
## # ... with 400 more rows
# Change the datatype of the data column, convert format to yyyy-mm-dd and rename it "date"
daily_activity <- daily_activity %>%
rename(Date = ActivityDate) %>%
mutate(Date = as_date(Date, format = "%m/%d/%Y"))
daily_sleep <- daily_sleep %>%
rename(Date = SleepDay) %>%
mutate(Date = as_date(Date,format ="%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))
## Warning: `tz` argument is ignored by `as_date()`
# Confirm column date is updated correctly
head(daily_activity)
## # A tibble: 6 x 15
## Id Date TotalSteps TotalDistance TrackerDistance LoggedActivitie~
## <dbl> <date> <dbl> <dbl> <dbl> <dbl>
## 1 1503960366 2016-04-12 13162 8.5 8.5 0
## 2 1503960366 2016-04-13 10735 6.97 6.97 0
## 3 1503960366 2016-04-14 10460 6.74 6.74 0
## 4 1503960366 2016-04-15 9762 6.28 6.28 0
## 5 1503960366 2016-04-16 12669 8.16 8.16 0
## 6 1503960366 2016-04-17 9705 6.48 6.48 0
## # ... with 9 more variables: VeryActiveDistance <dbl>,
## # ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## # SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## # FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## # SedentaryMinutes <dbl>, Calories <dbl>
head(daily_sleep)
## # A tibble: 6 x 5
## Id Date TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## <dbl> <date> <dbl> <dbl> <dbl>
## 1 1503960366 2016-04-12 1 327 346
## 2 1503960366 2016-04-13 2 384 407
## 3 1503960366 2016-04-15 1 412 442
## 4 1503960366 2016-04-16 2 340 367
## 5 1503960366 2016-04-17 1 700 712
## 6 1503960366 2016-04-19 1 304 320
Now that the data is stored appropriately and has been prepared for analysis we can start putting it to work.
How many observations are there in each dataframe?
nrow(daily_activity)
## [1] 940
nrow(daily_sleep)
## [1] 410
daily_activity %>%
select(TotalSteps,
TotalDistance,
SedentaryMinutes) %>%
summary()
## TotalSteps TotalDistance SedentaryMinutes
## Min. : 0 Min. : 0.000 Min. : 0.0
## 1st Qu.: 3790 1st Qu.: 2.620 1st Qu.: 729.8
## Median : 7406 Median : 5.245 Median :1057.5
## Mean : 7638 Mean : 5.490 Mean : 991.2
## 3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.:1229.5
## Max. :36019 Max. :28.030 Max. :1440.0
daily_sleep %>%
select(TotalSleepRecords,
TotalMinutesAsleep,
TotalTimeInBed) %>%
summary()
## TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## Min. :1.00 Min. : 58.0 Min. : 61.0
## 1st Qu.:1.00 1st Qu.:361.0 1st Qu.:403.8
## Median :1.00 Median :432.5 Median :463.0
## Mean :1.12 Mean :419.2 Mean :458.5
## 3rd Qu.:1.00 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :3.00 Max. :796.0 Max. :961.0
What’s the relationship between steps taken in a day and sedentary minutes? How could this help inform the customer segments that we can market to?
ggplot(data=daily_activity, aes(x=TotalSteps, y=SedentaryMinutes, color = Calories)) + geom_point()
ggplot(data=daily_sleep, aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) + geom_point(aes(color=Date))
We could definitely market consumers to use their watch to better monitor their time in bed against their sleep time. ## Or areas where you might want to explore further?
I wonder which days of week users often spend more time logging? How does this relates to the sedentary minutes??
daily_data <- merge(daily_activity, daily_sleep, by=c ("Id", "Date"))
glimpse(daily_data)
## Rows: 410
## Columns: 18
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036~
## $ Date <date> 2016-04-12, 2016-04-13, 2016-04-15, 2016-04-~
## $ TotalSteps <dbl> 13162, 10735, 9762, 12669, 9705, 15506, 10544~
## $ TotalDistance <dbl> 8.50, 6.97, 6.28, 8.16, 6.48, 9.88, 6.68, 6.3~
## $ TrackerDistance <dbl> 8.50, 6.97, 6.28, 8.16, 6.48, 9.88, 6.68, 6.3~
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.14, 2.71, 3.19, 3.53, 1.96, 1.3~
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 1.26, 0.41, 0.78, 1.32, 0.48, 0.3~
## $ LightActiveDistance <dbl> 6.06, 4.71, 2.83, 5.04, 2.51, 5.03, 4.24, 4.6~
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveMinutes <dbl> 25, 21, 29, 36, 38, 50, 28, 19, 41, 39, 73, 3~
## $ FairlyActiveMinutes <dbl> 13, 19, 34, 10, 20, 31, 12, 8, 21, 5, 14, 23,~
## $ LightlyActiveMinutes <dbl> 328, 217, 209, 221, 164, 264, 205, 211, 262, ~
## $ SedentaryMinutes <dbl> 728, 776, 726, 773, 539, 775, 818, 838, 732, ~
## $ Calories <dbl> 1985, 1797, 1745, 1863, 1728, 2035, 1786, 177~
## $ TotalSleepRecords <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ TotalMinutesAsleep <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, ~
## $ TotalTimeInBed <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, ~
.
n_distinct(daily_data$Id)
## [1] 24
combined_data <- merge(daily_activity, daily_sleep, by=c ('Id', 'Date'), all = TRUE)
head(combined_data)
## Id Date TotalSteps TotalDistance TrackerDistance
## 1 1503960366 2016-04-12 13162 8.50 8.50
## 2 1503960366 2016-04-13 10735 6.97 6.97
## 3 1503960366 2016-04-14 10460 6.74 6.74
## 4 1503960366 2016-04-15 9762 6.28 6.28
## 5 1503960366 2016-04-16 12669 8.16 8.16
## 6 1503960366 2016-04-17 9705 6.48 6.48
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 1.88 0.55
## 2 0 1.57 0.69
## 3 0 2.44 0.40
## 4 0 2.14 1.26
## 5 0 2.71 0.41
## 6 0 3.19 0.78
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 6.06 0 25
## 2 4.71 0 21
## 3 3.91 0 30
## 4 2.83 0 29
## 5 5.04 0 36
## 6 2.51 0 38
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 13 328 728 1985
## 2 19 217 776 1797
## 3 11 181 1218 1776
## 4 34 209 726 1745
## 5 10 221 773 1863
## 6 20 164 539 1728
## TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1 1 327 346
## 2 2 384 407
## 3 NA NA NA
## 4 1 412 442
## 5 2 340 367
## 6 1 700 712
n_distinct(combined_data$Id)
## [1] 33
sum(is.na(combined_data))
## [1] 1590
combined_data <- combined_data %>%
mutate_if(is.numeric, ~replace(., is.na(.), 0))
sum(is.na(combined_data))
## [1] 0
format(as.Date(combined_data$Date),"%w")
## [1] "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5"
## [19] "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6"
## [37] "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3"
## [55] "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4"
## [73] "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1"
## [91] "2" "3" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3"
## [109] "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4"
## [127] "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1"
## [145] "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0" "1" "2"
## [163] "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6"
## [181] "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0"
## [199] "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4"
## [217] "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5"
## [235] "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6"
## [253] "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "2" "3" "4" "5" "6"
## [271] "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3"
## [289] "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4"
## [307] "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "2" "3" "4" "5" "6" "0" "1" "2"
## [325] "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6"
## [343] "0" "1" "2" "3" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1"
## [361] "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2"
## [379] "3" "4" "5" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2"
## [397] "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3"
## [415] "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0"
## [433] "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0" "1"
## [451] "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5"
## [469] "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6"
## [487] "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3"
## [505] "4" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4"
## [523] "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5"
## [541] "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2"
## [559] "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0" "1" "2" "3"
## [577] "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0"
## [595] "1" "2" "3" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2"
## [613] "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6"
## [631] "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3"
## [649] "4" "5" "6" "0" "1" "2" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6"
## [667] "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "2" "3" "4" "5"
## [685] "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2"
## [703] "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0" "1" "2" "3"
## [721] "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "2"
## [739] "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6"
## [757] "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0"
## [775] "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4"
## [793] "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5"
## [811] "6" "0" "1" "2" "3" "4" "5" "6" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4"
## [829] "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1"
## [847] "2" "3" "4" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2"
## [865] "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3"
## [883] "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0"
## [901] "1" "2" "3" "4" "5" "6" "0" "1" "2" "2" "3" "4" "5" "6" "0" "1" "2" "3"
## [919] "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0"
## [937] "1" "2" "3" "4"
combined_data$DayOfTheWeek = weekdays(as.Date(combined_data$Date,format = "%Y-%m-%d"))
combined_data$DayOfTheWeek = factor(combined_data$Day, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday","Sunday"))
combined_data$TotalMinutes = combined_data$VeryActiveMinutes+combined_data$FairlyActiveMinutes+combined_data$LightlyActiveMinutes+combined_data$SedentaryMinutes
str(combined_data)
## 'data.frame': 940 obs. of 20 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ Date : Date, format: "2016-04-12" "2016-04-13" ...
## $ TotalSteps : num 13162 10735 10460 9762 12669 ...
## $ TotalDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : num 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : num 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : num 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : num 728 776 1218 726 773 ...
## $ Calories : num 1985 1797 1776 1745 1863 ...
## $ TotalSleepRecords : num 1 2 0 1 2 1 0 1 1 1 ...
## $ TotalMinutesAsleep : num 327 384 0 412 340 700 0 304 360 325 ...
## $ TotalTimeInBed : num 346 407 0 442 367 712 0 320 377 364 ...
## $ DayOfTheWeek : Factor w/ 7 levels "Monday","Tuesday",..: NA NA NA NA NA NA NA NA NA NA ...
## $ TotalMinutes : num 1094 1033 1440 998 1040 ...
combined_data$DayOfTheWeek <- format(as.Date(combined_data$Date),"%w")
wday(combined_data$Date, label=TRUE)
## [1] ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui sex
## [19] sáb dom seg ter qua qui sex sáb dom seg ter qua qui ter qua qui sex sáb
## [37] dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua
## [55] qui sex sáb dom seg ter qua qui ter qua qui sex sáb dom seg ter qua qui
## [73] sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg
## [91] ter qua ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua
## [109] qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui ter qua qui
## [127] sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg
## [145] ter qua qui sex sáb dom seg ter qua qui ter qua qui sex sáb dom seg ter
## [163] qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb
## [181] dom seg ter qua qui ter qua qui sex sáb dom seg ter qua qui sex sáb dom
## [199] seg ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui
## [217] ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui sex
## [235] sáb dom seg ter qua qui sex sáb dom seg ter qua qui ter qua qui sex sáb
## [253] dom seg ter qua qui sex sáb dom seg ter qua qui sex ter qua qui sex sáb
## [271] dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua
## [289] qui sex sáb dom seg ter qua qui ter qua qui sex sáb dom seg ter qua qui
## [307] sex sáb dom seg ter qua qui sex sáb dom ter qua qui sex sáb dom seg ter
## [325] qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb
## [343] dom seg ter qua ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg
## [361] ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui ter
## [379] qua qui sex ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter
## [397] qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui ter qua
## [415] qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb dom
## [433] seg ter qua qui sex sáb dom seg ter qua qui ter qua qui sex sáb dom seg
## [451] ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui sex
## [469] sáb dom seg ter qua qui ter qua qui sex sáb dom seg ter qua qui sex sáb
## [487] dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua
## [505] qui ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui
## [523] sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui ter qua qui sex
## [541] sáb dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter
## [559] qua qui sex sáb dom seg ter qua qui ter qua qui sex sáb dom seg ter qua
## [577] qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb dom
## [595] seg ter qua ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter
## [613] qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb
## [631] dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua
## [649] qui sex sáb dom seg ter ter qua qui sex sáb dom seg ter qua qui sex sáb
## [667] dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb ter qua qui sex
## [685] sáb dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter
## [703] qua qui sex sáb dom seg ter qua qui ter qua qui sex sáb dom seg ter qua
## [721] qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb ter
## [739] qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb
## [757] dom seg ter qua qui sex sáb dom seg ter qua qui ter qua qui sex sáb dom
## [775] seg ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui
## [793] sex sáb dom seg ter qua qui ter qua qui sex sáb dom seg ter qua qui sex
## [811] sáb dom seg ter qua qui sex sáb ter qua qui sex sáb dom seg ter qua qui
## [829] sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg
## [847] ter qua qui ter qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter
## [865] qua qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui ter qua
## [883] qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb dom
## [901] seg ter qua qui sex sáb dom seg ter ter qua qui sex sáb dom seg ter qua
## [919] qui sex sáb dom seg ter qua qui sex sáb dom seg ter qua qui sex sáb dom
## [937] seg ter qua qui
## Levels: dom < seg < ter < qua < qui < sex < sáb
combined_data$DayOfTheWeek = strftime(combined_data$Date,'%A')
combined_data$TotalHours <- round((combined_data$TotalMinutes/60), digits=2)
combined_data %>%
select(TotalSteps,
SedentaryMinutes,
Calories) %>%
summary()
## TotalSteps SedentaryMinutes Calories
## Min. : 0 Min. : 0.0 Min. : 0
## 1st Qu.: 3790 1st Qu.: 729.8 1st Qu.:1828
## Median : 7406 Median :1057.5 Median :2134
## Mean : 7638 Mean : 991.2 Mean :2304
## 3rd Qu.:10727 3rd Qu.:1229.5 3rd Qu.:2793
## Max. :36019 Max. :1440.0 Max. :4900
#install.packages("ggpubr")
library(ggpubr)
p1= ggplot(data=combined_data)+
geom_point(mapping = aes(x = TotalDistance, y =Calories),color="yellow")+
labs(title="Total Distance vs. Calories")
p2= ggplot(data=combined_data)+
geom_point(mapping = aes(x = TotalSteps, y =Calories),color="green")+
labs(title="Total Steps vs. Calories")
p3= ggplot(data=combined_data)+
geom_point(mapping = aes(x = TotalMinutes, y =Calories),color="red")+
labs(title="Total Minutes vs. Calories")
ggarrange(p1, p2, p3, ncol = 3, nrow = 1)
In the final step, we will be delivering our insights and providing recommendations based on our analysis. Here, we revisit our business questions and share with you our high-level business recommendations.
Majority of users (81.10%) are using the FitBit app to track sedentary activities and not using it for tracking their health habits. Users prefer to track their activities during weekdays as compared to weekends - perhaps because they spend more time outside on weekdays and stay in on weekends. Data also tell us that most users log in their calories, steps taken, etc, and fewer log their sleep data.
Both companies develop products focused on providing women with their health, habit and fitness data and encouraging them to understand their current habits and make healthy decisions. These common trends surrounding health and fitness can very well be applied to Bellabeat customers. Bellabeat could easily market these type of costumers by telling them smart-devices could help them start their journey by measuring how much they’re moving and how these moments of activity would benefit them to live longer!
It is well documented that moderate-to-vigorous physical activity is protective against chronic disease. Conversely, emerging evidence indicates the deleterious effects of prolonged sitting, so in a need to change both behaviors, self-monitoring of behavior is one of the most robust behavior-change techniques available. Bellabeat marketing team can encourage users by educating and equipping them with knowledge about fitness benefits, suggest different types of exercise (ie. simple 10 minutes exercise on weekday and a more intense exercise on weekends) and calories intake and burnt rate information on the Bellabeat app.On weekends, Bellabeat app can also prompt notification to encourage users to exercise. By marketing these devices to consumers, Bellabeat provides a unique opportunity for individuals to change their behavior, become more physically active and increase their life expectancy.