BellaBeat Casestudy
This is the capstone project of Google Data Analytics Certification.
I choose the second case of the 2 casestudies: Bellabeat,a high-tech
manufacturer of health-focused products for women.
Whats the case?
Urška Sršen, Bellabeat’s cofounder and Chief Creative Officer
together with Sando Mur, Mathematician and Bellabeat’s cofounder/key
member of the Bellabeat executive team and Bellabeat marketing analytics
team, asks you to analyze smart device usage data in order to gain
insight into how consumers use non-Bellabeat smart devices. After I get
my results I will select one Bellabeat product to apply these insights
to in my presentation. Those products are:
- Bellabeat app: The Bellabeat app provides users with health data
related to their activity, sleep, stress, menstrual cycle, and
mindfulness habits. This data can help users better understand their
current habits and make healthy decisions. The Bellabeat app connects to
their line of smart wellness products.
- Leaf: Bellabeat’s classic wellness tracker can be worn as a
bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat
app to track activity, sleep, and stress.
- Time: This wellness watch combines the timeless look of a classic
timepiece with smart technology to track user activity, sleep, and
stress. The Time watch connects to the Bellabeat app to provide you with
insights into your daily wellness.
- Spring: This is a water bottle that tracks daily water intake using
smart technology to ensure that you are appropriately hydrated
throughout the day. The Spring bottle connects to the Bellabeat app to
track your hydration levels.
- Bellabeat membership: Bellabeat also offers a subscription-based
membership program for users. Membership gives users 24/7 access to
fully personalized guidance on nutrition, activity, sleep, health and
beauty, and mindfulness based on their lifestyle and goals.
Main questions from stakeholders
The scope of this analysis is to get answers in the following
questions:
- What are some trends in smart device usage?
- How could these trends apply to Bellabeat customers?
- How could these trends help influence Bellabeat marketing
strategy?
Asking deliverables
- A clear summary of the business task
- A description of all data sources used
- Documentation of any cleaning or manipulation of data
- A summary of my analysis
- Supporting visualizations and key findings
- My top high-level content recommendations based on my analysis
Used Data
The data is: FitBit Fitness
Tracker Data (CC0: Public Domain, dataset made available through Mobius.This Kaggle data set
contains personal fitness tracker from thirty fitbit users. Thirty
eligible Fitbit users consented to the submission of personal tracker
data, including minute-level output for physical activity, heart rate,
and sleep monitoring. It includes information about daily activity,
steps, and heart rate that can be used to explore users’ habits.
Not so RoCCC
Reliability: Low - sample size nearly 33 people Originality: Low -
Third party provider Comprehensive: Medium - The data does not include
genders Cited: Low - Third party and is used from public source.
Fixing a problem with packages link
options(repos = list(CRAN="http://cran.rstudio.com/"))
Importing Daily Datasets examine, clean and orginise them
Importing all the necessary databases. I choosed only the daily ones,
because I believe its in the daily activities that we can find the
trends. I saw their columns and I decided to change the column
Activityday/date/sleep day to just Date, so I can merge them later
through their unique user id and the day of the activities to a single
dataframe.
Activity <- read_csv("D:/BellaBeat/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(Activity)
## # A tibble: 6 × 15
## Id Activ…¹ Total…² Total…³ Track…⁴ Logge…⁵ VeryA…⁶ Moder…⁷ Light…⁸ Seden…⁹
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2… 13162 8.5 8.5 0 1.88 0.550 6.06 0
## 2 1.50e9 4/13/2… 10735 6.97 6.97 0 1.57 0.690 4.71 0
## 3 1.50e9 4/14/2… 10460 6.74 6.74 0 2.44 0.400 3.91 0
## 4 1.50e9 4/15/2… 9762 6.28 6.28 0 2.14 1.26 2.83 0
## 5 1.50e9 4/16/2… 12669 8.16 8.16 0 2.71 0.410 5.04 0
## 6 1.50e9 4/17/2… 9705 6.48 6.48 0 3.19 0.780 2.51 0
## # … with 5 more variables: VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## # LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>, and
## # abbreviated variable names ¹ActivityDate, ²TotalSteps, ³TotalDistance,
## # ⁴TrackerDistance, ⁵LoggedActivitiesDistance, ⁶VeryActiveDistance,
## # ⁷ModeratelyActiveDistance, ⁸LightActiveDistance, ⁹SedentaryActiveDistance
sapply(Activity, function(x) length(unique(x)))
## Id ActivityDate TotalSteps
## 33 31 842
## TotalDistance TrackerDistance LoggedActivitiesDistance
## 615 613 19
## VeryActiveDistance ModeratelyActiveDistance LightActiveDistance
## 333 211 491
## SedentaryActiveDistance VeryActiveMinutes FairlyActiveMinutes
## 9 122 81
## LightlyActiveMinutes SedentaryMinutes Calories
## 335 549 734
Activity %>%
rename(Date=ActivityDate)
## # A tibble: 940 × 15
## Id Date Total…¹ Total…² Track…³ Logge…⁴ VeryA…⁵ Moder…⁶ Light…⁷ Seden…⁸
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12… 13162 8.5 8.5 0 1.88 0.550 6.06 0
## 2 1.50e9 4/13… 10735 6.97 6.97 0 1.57 0.690 4.71 0
## 3 1.50e9 4/14… 10460 6.74 6.74 0 2.44 0.400 3.91 0
## 4 1.50e9 4/15… 9762 6.28 6.28 0 2.14 1.26 2.83 0
## 5 1.50e9 4/16… 12669 8.16 8.16 0 2.71 0.410 5.04 0
## 6 1.50e9 4/17… 9705 6.48 6.48 0 3.19 0.780 2.51 0
## 7 1.50e9 4/18… 13019 8.59 8.59 0 3.25 0.640 4.71 0
## 8 1.50e9 4/19… 15506 9.88 9.88 0 3.53 1.32 5.03 0
## 9 1.50e9 4/20… 10544 6.68 6.68 0 1.96 0.480 4.24 0
## 10 1.50e9 4/21… 9819 6.34 6.34 0 1.34 0.350 4.65 0
## # … with 930 more rows, 5 more variables: VeryActiveMinutes <dbl>,
## # FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## # SedentaryMinutes <dbl>, Calories <dbl>, and abbreviated variable names
## # ¹TotalSteps, ²TotalDistance, ³TrackerDistance, ⁴LoggedActivitiesDistance,
## # ⁵VeryActiveDistance, ⁶ModeratelyActiveDistance, ⁷LightActiveDistance,
## # ⁸SedentaryActiveDistance
Calories <- read_csv("D:/BellaBeat/Fitabase Data 4.12.16-5.12.16/dailyCalories_merged.csv")
## Rows: 940 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, Calories
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(Calories)
## # A tibble: 6 × 3
## Id ActivityDay Calories
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 1985
## 2 1503960366 4/13/2016 1797
## 3 1503960366 4/14/2016 1776
## 4 1503960366 4/15/2016 1745
## 5 1503960366 4/16/2016 1863
## 6 1503960366 4/17/2016 1728
Calories %>%
rename(Date=ActivityDay)
## # A tibble: 940 × 3
## Id Date Calories
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 1985
## 2 1503960366 4/13/2016 1797
## 3 1503960366 4/14/2016 1776
## 4 1503960366 4/15/2016 1745
## 5 1503960366 4/16/2016 1863
## 6 1503960366 4/17/2016 1728
## 7 1503960366 4/18/2016 1921
## 8 1503960366 4/19/2016 2035
## 9 1503960366 4/20/2016 1786
## 10 1503960366 4/21/2016 1775
## # … with 930 more rows
Steps <-read_csv ("D:/BellaBeat/Fitabase Data 4.12.16-5.12.16/dailySteps_merged.csv")
## Rows: 940 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, StepTotal
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(Steps)
## # A tibble: 6 × 3
## Id ActivityDay StepTotal
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 13162
## 2 1503960366 4/13/2016 10735
## 3 1503960366 4/14/2016 10460
## 4 1503960366 4/15/2016 9762
## 5 1503960366 4/16/2016 12669
## 6 1503960366 4/17/2016 9705
Steps %>%
rename(Date=ActivityDay)
## # A tibble: 940 × 3
## Id Date StepTotal
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 13162
## 2 1503960366 4/13/2016 10735
## 3 1503960366 4/14/2016 10460
## 4 1503960366 4/15/2016 9762
## 5 1503960366 4/16/2016 12669
## 6 1503960366 4/17/2016 9705
## 7 1503960366 4/18/2016 13019
## 8 1503960366 4/19/2016 15506
## 9 1503960366 4/20/2016 10544
## 10 1503960366 4/21/2016 9819
## # … with 930 more rows
Sleep <-read_csv ("D:/BellaBeat/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head (Sleep)
## # A tibble: 6 × 5
## Id SleepDay TotalSleepRecords TotalMinutesAsleep TotalT…¹
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 1 327 346
## 2 1503960366 4/13/2016 12:00:00 AM 2 384 407
## 3 1503960366 4/15/2016 12:00:00 AM 1 412 442
## 4 1503960366 4/16/2016 12:00:00 AM 2 340 367
## 5 1503960366 4/17/2016 12:00:00 AM 1 700 712
## 6 1503960366 4/19/2016 12:00:00 AM 1 304 320
## # … with abbreviated variable name ¹TotalTimeInBed
Sleep %>%
rename(Date=SleepDay)
## # A tibble: 413 × 5
## Id Date TotalSleepRecords TotalMinutesAsleep Total…¹
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 1 327 346
## 2 1503960366 4/13/2016 12:00:00 AM 2 384 407
## 3 1503960366 4/15/2016 12:00:00 AM 1 412 442
## 4 1503960366 4/16/2016 12:00:00 AM 2 340 367
## 5 1503960366 4/17/2016 12:00:00 AM 1 700 712
## 6 1503960366 4/19/2016 12:00:00 AM 1 304 320
## 7 1503960366 4/20/2016 12:00:00 AM 1 360 377
## 8 1503960366 4/21/2016 12:00:00 AM 1 325 364
## 9 1503960366 4/23/2016 12:00:00 AM 1 361 384
## 10 1503960366 4/24/2016 12:00:00 AM 1 430 449
## # … with 403 more rows, and abbreviated variable name ¹TotalTimeInBed
Cleaning Daily Datasets
The next step is to examine,format,clean datasets for inconsistency
in their inputs. First checking the data I have from each dataset: I saw
their column names, I clean the dataset, I change the format of the date
from characters to the Date POSIXct format, so they can all have the
right input. I also make sure that the names in columns will be right by
using the clean_name command. I took the information of how many unique
users participated in this essay.
head(Activity)
## # A tibble: 6 × 15
## Id Activ…¹ Total…² Total…³ Track…⁴ Logge…⁵ VeryA…⁶ Moder…⁷ Light…⁸ Seden…⁹
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2… 13162 8.5 8.5 0 1.88 0.550 6.06 0
## 2 1.50e9 4/13/2… 10735 6.97 6.97 0 1.57 0.690 4.71 0
## 3 1.50e9 4/14/2… 10460 6.74 6.74 0 2.44 0.400 3.91 0
## 4 1.50e9 4/15/2… 9762 6.28 6.28 0 2.14 1.26 2.83 0
## 5 1.50e9 4/16/2… 12669 8.16 8.16 0 2.71 0.410 5.04 0
## 6 1.50e9 4/17/2… 9705 6.48 6.48 0 3.19 0.780 2.51 0
## # … with 5 more variables: VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## # LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>, and
## # abbreviated variable names ¹ActivityDate, ²TotalSteps, ³TotalDistance,
## # ⁴TrackerDistance, ⁵LoggedActivitiesDistance, ⁶VeryActiveDistance,
## # ⁷ModeratelyActiveDistance, ⁸LightActiveDistance, ⁹SedentaryActiveDistance
sapply(Activity, function(x) length(unique(x)))
## Id ActivityDate TotalSteps
## 33 31 842
## TotalDistance TrackerDistance LoggedActivitiesDistance
## 615 613 19
## VeryActiveDistance ModeratelyActiveDistance LightActiveDistance
## 333 211 491
## SedentaryActiveDistance VeryActiveMinutes FairlyActiveMinutes
## 9 122 81
## LightlyActiveMinutes SedentaryMinutes Calories
## 335 549 734
colnames(Activity)
## [1] "Id" "ActivityDate"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories"
clean_names(Activity)
## # A tibble: 940 × 15
## id activity…¹ total…² total…³ track…⁴ logge…⁵ very_…⁶ moder…⁷ light…⁸
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 13162 8.5 8.5 0 1.88 0.550 6.06
## 2 1503960366 4/13/2016 10735 6.97 6.97 0 1.57 0.690 4.71
## 3 1503960366 4/14/2016 10460 6.74 6.74 0 2.44 0.400 3.91
## 4 1503960366 4/15/2016 9762 6.28 6.28 0 2.14 1.26 2.83
## 5 1503960366 4/16/2016 12669 8.16 8.16 0 2.71 0.410 5.04
## 6 1503960366 4/17/2016 9705 6.48 6.48 0 3.19 0.780 2.51
## 7 1503960366 4/18/2016 13019 8.59 8.59 0 3.25 0.640 4.71
## 8 1503960366 4/19/2016 15506 9.88 9.88 0 3.53 1.32 5.03
## 9 1503960366 4/20/2016 10544 6.68 6.68 0 1.96 0.480 4.24
## 10 1503960366 4/21/2016 9819 6.34 6.34 0 1.34 0.350 4.65
## # … with 930 more rows, 6 more variables: sedentary_active_distance <dbl>,
## # very_active_minutes <dbl>, fairly_active_minutes <dbl>,
## # lightly_active_minutes <dbl>, sedentary_minutes <dbl>, calories <dbl>, and
## # abbreviated variable names ¹activity_date, ²total_steps, ³total_distance,
## # ⁴tracker_distance, ⁵logged_activities_distance, ⁶very_active_distance,
## # ⁷moderately_active_distance, ⁸light_active_distance
Activity %>%
rename(Date=ActivityDate)
## # A tibble: 940 × 15
## Id Date Total…¹ Total…² Track…³ Logge…⁴ VeryA…⁵ Moder…⁶ Light…⁷ Seden…⁸
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12… 13162 8.5 8.5 0 1.88 0.550 6.06 0
## 2 1.50e9 4/13… 10735 6.97 6.97 0 1.57 0.690 4.71 0
## 3 1.50e9 4/14… 10460 6.74 6.74 0 2.44 0.400 3.91 0
## 4 1.50e9 4/15… 9762 6.28 6.28 0 2.14 1.26 2.83 0
## 5 1.50e9 4/16… 12669 8.16 8.16 0 2.71 0.410 5.04 0
## 6 1.50e9 4/17… 9705 6.48 6.48 0 3.19 0.780 2.51 0
## 7 1.50e9 4/18… 13019 8.59 8.59 0 3.25 0.640 4.71 0
## 8 1.50e9 4/19… 15506 9.88 9.88 0 3.53 1.32 5.03 0
## 9 1.50e9 4/20… 10544 6.68 6.68 0 1.96 0.480 4.24 0
## 10 1.50e9 4/21… 9819 6.34 6.34 0 1.34 0.350 4.65 0
## # … with 930 more rows, 5 more variables: VeryActiveMinutes <dbl>,
## # FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## # SedentaryMinutes <dbl>, Calories <dbl>, and abbreviated variable names
## # ¹TotalSteps, ²TotalDistance, ³TrackerDistance, ⁴LoggedActivitiesDistance,
## # ⁵VeryActiveDistance, ⁶ModeratelyActiveDistance, ⁷LightActiveDistance,
## # ⁸SedentaryActiveDistance
Activity$ActivityDate=as.POSIXct(Activity$ActivityDate, format="%m/%d/%Y", tz=Sys.timezone())
head(Calories)
## # A tibble: 6 × 3
## Id ActivityDay Calories
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 1985
## 2 1503960366 4/13/2016 1797
## 3 1503960366 4/14/2016 1776
## 4 1503960366 4/15/2016 1745
## 5 1503960366 4/16/2016 1863
## 6 1503960366 4/17/2016 1728
sapply(Calories, function(x) length(unique(x)))
## Id ActivityDay Calories
## 33 31 734
colnames(Calories)
## [1] "Id" "ActivityDay" "Calories"
clean_names(Calories)
## # A tibble: 940 × 3
## id activity_day calories
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 1985
## 2 1503960366 4/13/2016 1797
## 3 1503960366 4/14/2016 1776
## 4 1503960366 4/15/2016 1745
## 5 1503960366 4/16/2016 1863
## 6 1503960366 4/17/2016 1728
## 7 1503960366 4/18/2016 1921
## 8 1503960366 4/19/2016 2035
## 9 1503960366 4/20/2016 1786
## 10 1503960366 4/21/2016 1775
## # … with 930 more rows
Calories %>%
rename(Date=ActivityDay)
## # A tibble: 940 × 3
## Id Date Calories
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 1985
## 2 1503960366 4/13/2016 1797
## 3 1503960366 4/14/2016 1776
## 4 1503960366 4/15/2016 1745
## 5 1503960366 4/16/2016 1863
## 6 1503960366 4/17/2016 1728
## 7 1503960366 4/18/2016 1921
## 8 1503960366 4/19/2016 2035
## 9 1503960366 4/20/2016 1786
## 10 1503960366 4/21/2016 1775
## # … with 930 more rows
Calories$ActivityDay=as.POSIXct(Calories$ActivityDay, format="%m/%d/%Y", tz=Sys.timezone())
head(Steps)
## # A tibble: 6 × 3
## Id ActivityDay StepTotal
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 13162
## 2 1503960366 4/13/2016 10735
## 3 1503960366 4/14/2016 10460
## 4 1503960366 4/15/2016 9762
## 5 1503960366 4/16/2016 12669
## 6 1503960366 4/17/2016 9705
sapply(Steps, function(x) length(unique(x)))
## Id ActivityDay StepTotal
## 33 31 842
colnames(Steps)
## [1] "Id" "ActivityDay" "StepTotal"
clean_names(Steps)
## # A tibble: 940 × 3
## id activity_day step_total
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 13162
## 2 1503960366 4/13/2016 10735
## 3 1503960366 4/14/2016 10460
## 4 1503960366 4/15/2016 9762
## 5 1503960366 4/16/2016 12669
## 6 1503960366 4/17/2016 9705
## 7 1503960366 4/18/2016 13019
## 8 1503960366 4/19/2016 15506
## 9 1503960366 4/20/2016 10544
## 10 1503960366 4/21/2016 9819
## # … with 930 more rows
Steps$ActivityDay=as.POSIXct(Steps$ActivityDay, format="%m/%d/%Y", tz=Sys.timezone())
Steps %>%
rename(Date=ActivityDay)
## # A tibble: 940 × 3
## Id Date StepTotal
## <dbl> <dttm> <dbl>
## 1 1503960366 2016-04-12 00:00:00 13162
## 2 1503960366 2016-04-13 00:00:00 10735
## 3 1503960366 2016-04-14 00:00:00 10460
## 4 1503960366 2016-04-15 00:00:00 9762
## 5 1503960366 2016-04-16 00:00:00 12669
## 6 1503960366 2016-04-17 00:00:00 9705
## 7 1503960366 2016-04-18 00:00:00 13019
## 8 1503960366 2016-04-19 00:00:00 15506
## 9 1503960366 2016-04-20 00:00:00 10544
## 10 1503960366 2016-04-21 00:00:00 9819
## # … with 930 more rows
head(Sleep)
## # A tibble: 6 × 5
## Id SleepDay TotalSleepRecords TotalMinutesAsleep TotalT…¹
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 1 327 346
## 2 1503960366 4/13/2016 12:00:00 AM 2 384 407
## 3 1503960366 4/15/2016 12:00:00 AM 1 412 442
## 4 1503960366 4/16/2016 12:00:00 AM 2 340 367
## 5 1503960366 4/17/2016 12:00:00 AM 1 700 712
## 6 1503960366 4/19/2016 12:00:00 AM 1 304 320
## # … with abbreviated variable name ¹TotalTimeInBed
sapply(Sleep, function(x) length(unique(x)))
## Id SleepDay TotalSleepRecords TotalMinutesAsleep
## 24 31 3 256
## TotalTimeInBed
## 242
colnames(Sleep)
## [1] "Id" "SleepDay" "TotalSleepRecords"
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
clean_names(Sleep)
## # A tibble: 413 × 5
## id sleep_day total_sleep_records total_minutes_…¹ total…²
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 1 327 346
## 2 1503960366 4/13/2016 12:00:00 AM 2 384 407
## 3 1503960366 4/15/2016 12:00:00 AM 1 412 442
## 4 1503960366 4/16/2016 12:00:00 AM 2 340 367
## 5 1503960366 4/17/2016 12:00:00 AM 1 700 712
## 6 1503960366 4/19/2016 12:00:00 AM 1 304 320
## 7 1503960366 4/20/2016 12:00:00 AM 1 360 377
## 8 1503960366 4/21/2016 12:00:00 AM 1 325 364
## 9 1503960366 4/23/2016 12:00:00 AM 1 361 384
## 10 1503960366 4/24/2016 12:00:00 AM 1 430 449
## # … with 403 more rows, and abbreviated variable names ¹total_minutes_asleep,
## # ²total_time_in_bed
Sleep %>%
rename(Date=SleepDay)
## # A tibble: 413 × 5
## Id Date TotalSleepRecords TotalMinutesAsleep Total…¹
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 1 327 346
## 2 1503960366 4/13/2016 12:00:00 AM 2 384 407
## 3 1503960366 4/15/2016 12:00:00 AM 1 412 442
## 4 1503960366 4/16/2016 12:00:00 AM 2 340 367
## 5 1503960366 4/17/2016 12:00:00 AM 1 700 712
## 6 1503960366 4/19/2016 12:00:00 AM 1 304 320
## 7 1503960366 4/20/2016 12:00:00 AM 1 360 377
## 8 1503960366 4/21/2016 12:00:00 AM 1 325 364
## 9 1503960366 4/23/2016 12:00:00 AM 1 361 384
## 10 1503960366 4/24/2016 12:00:00 AM 1 430 449
## # … with 403 more rows, and abbreviated variable name ¹TotalTimeInBed
Sleep$SleepDay=as.POSIXct(Sleep$SleepDay, format="%m/%d/%Y", tz=Sys.timezone())
First conclusions
From the cleaning I find out that I don’t have the same amount of
participation to all data sets. Some participants are missing. In the
Activity and Intensities and Calories dataset I have 33 unique ids, but
in Sleep I have 24, in Steps 24. Something very important to notice here
is that the sample we have is very small for a real data analysis.Also
we are already missing entries, so in real life that it wound not be
suggested as a good data sourse to work with. Another very important
issue is that we don’t have gender separation in the data. Also the data
are not updated since they are from 2016.
Analysing the Data
First I am gonna take the average of some activity columns that are
important:
Activity %>%
select(TotalSteps,
TotalDistance,
SedentaryMinutes,
VeryActiveMinutes,
Calories) %>%
summary()
## TotalSteps TotalDistance SedentaryMinutes VeryActiveMinutes
## Min. : 0 Min. : 0.000 Min. : 0.0 Min. : 0.00
## 1st Qu.: 3790 1st Qu.: 2.620 1st Qu.: 729.8 1st Qu.: 0.00
## Median : 7406 Median : 5.245 Median :1057.5 Median : 4.00
## Mean : 7638 Mean : 5.490 Mean : 991.2 Mean : 21.16
## 3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.:1229.5 3rd Qu.: 32.00
## Max. :36019 Max. :28.030 Max. :1440.0 Max. :210.00
## Calories
## Min. : 0
## 1st Qu.:1828
## Median :2134
## Mean :2304
## 3rd Qu.:2793
## Max. :4900
The average total steps are 7638, the average distance 5.490. The
numbers are bellow the standards.
The sedentary minutes on average is 991.2 Divided this by 60 is
almost 17 hours non activity. That shows me that the participants are
not very active or they didn’t wear their smart device.
Very active average was 21.16 minutes per day. The average of
calories intake was 2304.
Then examine some data about sleep:
Sleep %>%
select(TotalSleepRecords,
TotalMinutesAsleep,
TotalTimeInBed) %>%
summary()
## TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## Min. :1.000 Min. : 58.0 Min. : 61.0
## 1st Qu.:1.000 1st Qu.:361.0 1st Qu.:403.0
## Median :1.000 Median :433.0 Median :463.0
## Mean :1.119 Mean :419.5 Mean :458.6
## 3rd Qu.:1.000 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :3.000 Max. :796.0 Max. :961.0
the total minutes as sleep are 419 (almost 7 hours). The number shows
that overall the wearers had enough sleep. And we can see this by
separate them in 3 categories:
sleepcategories <- Sleep %>%
group_by(Id) %>%
summarise(avg_time_asleep = mean(TotalMinutesAsleep)) %>%
mutate(type=case_when (
avg_time_asleep < 300 ~ "need more sleep",
avg_time_asleep >=300 & avg_time_asleep <= 420 ~ "average sleep",
avg_time_asleep > 420 ~ "enough sleep"))
sleepcategories
## # A tibble: 24 × 3
## Id avg_time_asleep type
## <dbl> <dbl> <chr>
## 1 1503960366 360. average sleep
## 2 1644430081 294 need more sleep
## 3 1844505072 652 enough sleep
## 4 1927972279 417 average sleep
## 5 2026352035 506. enough sleep
## 6 2320127002 61 need more sleep
## 7 2347167796 447. enough sleep
## 8 3977333714 294. need more sleep
## 9 4020332650 349. average sleep
## 10 4319703577 477. enough sleep
## # … with 14 more rows
n_Average_sleepers <- sum(sleepcategories$type == 'average sleep')
n_Average_sleepers
## [1] 6
n_Need_more_sleep <- sum(sleepcategories$type == 'need more sleep')
n_Need_more_sleep
## [1] 6
n_Enough_sleep <- sum(sleepcategories$type == 'enough sleep')
n_Enough_sleep
## [1] 12
From the above we see that people who have a healthy sleep pattern
are half of the number, so only 1/3 need more sleep
Steps %>%
select(StepTotal) %>%
summary()
## StepTotal
## Min. : 0
## 1st Qu.: 3790
## Median : 7406
## Mean : 7638
## 3rd Qu.:10727
## Max. :36019
In the Steps database we can clearly see that the sample are not
great walkers.
Calories %>%
select(Calories,
ActivityDay) %>%
summary()
## Calories ActivityDay
## Min. : 0 Min. :2016-04-12 00:00:00.00
## 1st Qu.:1828 1st Qu.:2016-04-19 00:00:00.00
## Median :2134 Median :2016-04-26 00:00:00.00
## Mean :2304 Mean :2016-04-26 06:53:37.01
## 3rd Qu.:2793 3rd Qu.:2016-05-04 00:00:00.00
## Max. :4900 Max. :2016-05-12 00:00:00.00
In the calories, we have an average of 2304 calories which is the
average in the population from NHS
Then I merge everything in a bigger dataset which will include all
the categories we need to find some more trends.
merged_data <- merge( Activity, Steps, by = c('Id'))
head(merged_data)
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366 2016-04-12 13162 8.5 8.5
## 2 1503960366 2016-04-12 13162 8.5 8.5
## 3 1503960366 2016-04-12 13162 8.5 8.5
## 4 1503960366 2016-04-12 13162 8.5 8.5
## 5 1503960366 2016-04-12 13162 8.5 8.5
## 6 1503960366 2016-04-12 13162 8.5 8.5
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 1.88 0.55
## 2 0 1.88 0.55
## 3 0 1.88 0.55
## 4 0 1.88 0.55
## 5 0 1.88 0.55
## 6 0 1.88 0.55
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 6.06 0 25
## 2 6.06 0 25
## 3 6.06 0 25
## 4 6.06 0 25
## 5 6.06 0 25
## 6 6.06 0 25
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 13 328 728 1985
## 2 13 328 728 1985
## 3 13 328 728 1985
## 4 13 328 728 1985
## 5 13 328 728 1985
## 6 13 328 728 1985
## ActivityDay StepTotal
## 1 2016-04-12 13162
## 2 2016-04-13 10735
## 3 2016-04-14 10460
## 4 2016-04-15 9762
## 5 2016-04-16 12669
## 6 2016-04-17 9705
merged_data_all<-merge(merged_data, Sleep, by=c('Id'))
head(merged_data_all)
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366 2016-05-09 12022 7.72 7.72
## 2 1503960366 2016-05-09 12022 7.72 7.72
## 3 1503960366 2016-05-09 12022 7.72 7.72
## 4 1503960366 2016-05-09 12022 7.72 7.72
## 5 1503960366 2016-05-09 12022 7.72 7.72
## 6 1503960366 2016-05-09 12022 7.72 7.72
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 3.45 0.53
## 2 0 3.45 0.53
## 3 0 3.45 0.53
## 4 0 3.45 0.53
## 5 0 3.45 0.53
## 6 0 3.45 0.53
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 3.74 0 46
## 2 3.74 0 46
## 3 3.74 0 46
## 4 3.74 0 46
## 5 3.74 0 46
## 6 3.74 0 46
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 11 206 835 1819
## 2 11 206 835 1819
## 3 11 206 835 1819
## 4 11 206 835 1819
## 5 11 206 835 1819
## 6 11 206 835 1819
## ActivityDay StepTotal SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 2016-04-27 18134 2016-04-12 1 327
## 2 2016-04-27 18134 2016-04-13 2 384
## 3 2016-04-27 18134 2016-04-15 1 412
## 4 2016-04-27 18134 2016-04-16 2 340
## 5 2016-04-27 18134 2016-04-17 1 700
## 6 2016-04-27 18134 2016-04-19 1 304
## TotalTimeInBed
## 1 346
## 2 407
## 3 442
## 4 367
## 5 712
## 6 320
In the dataset merged_data_all I have all the information I want.
Visual Trends
By comparing some data from the merged dataset we can see that:
caloriesxsteps <- ggplot(data=Activity, aes(x=TotalSteps, y=Calories))+ geom_point(color="pink")+ geom_smooth()+labs(title = "Calories Burned vs Total Daily Steps",x="Total Steps", y="Calories Burned")
caloriesxsteps
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.
We have a correlation between the steps and the calories. More steps,
more calories burned.
activityxsleep <- ggplot(data=merged_data_all, aes(x=VeryActiveMinutes, y=TotalMinutesAsleep))+ geom_smooth()+labs(title = "Activity vs Sleep",x="Activity", y="sleep")
activityxsleep
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
The more active day somebody have the more needs to sleep.
Sleepxcalories <- ggplot(data=merged_data_all, aes(x=TotalMinutesAsleep, y=Calories))+ geom_smooth()+labs(title = "Sleep and Calories",x="Sleep", y="Burning Calories")
Sleepxcalories
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Sleep is a great way of burning calories.
Recomentations based on the analysis
I would suggest about the Bellabeat app: + Based on the analysis, the
users must be more active. They dont walk so much OR they dont wear
their smart device to record their activities. So I would suggest that
the Bellabeat app, should have some reminders about mobility. Make a
lower time limit at about 30 minute each day, or by calculating steps
(10000). When they reach this goal, they can have a badge or something
rewarding, so they can continue to do this each day. + Sleep
notifications together with stress reducing exercises in their mobile
phone like for example breathing exercises or some kind of meditation. +
The users are watching their activities, but because we have missing
records or a lot time of inactivity, maybe there must be a way to
automatically record the data to app, through the others smart devices,
and not depending on manual entries of the users.
Answers to the questions:
- What are some trends in smart device usage?
People use smart devices to record their everyday activity, but
they don’t necessary record it everyday.
From the sample we can see that smart devices need to notify
users so they can stay on track.
If someone tracks their records, its more possible to continue
their healthy habits.
- How could these trends apply to Bellabeat customers? Tracking
everyday habits, makes people improve their way of living. Having better
sleep, exercising more and watching their progress is a huge advance for
someone to continue trying for their well-being.
- How could these trends help influence Bellabeat marketing strategy?
Bellabeat should continue produce high quality smart devices. They also
should make an effort to make their app, an everyday neccecity for the
user. By notify nad inform them about their progress and ever reward
them when they are doing great!