BellaBeat Casestudy

This is the capstone project of Google Data Analytics Certification. I choose the second case of the 2 casestudies: Bellabeat,a high-tech manufacturer of health-focused products for women.

Whats the case?

Urška Sršen, Bellabeat’s cofounder and Chief Creative Officer together with Sando Mur, Mathematician and Bellabeat’s cofounder/key member of the Bellabeat executive team and Bellabeat marketing analytics team, asks you to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices. After I get my results I will select one Bellabeat product to apply these insights to in my presentation. Those products are:

Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.
Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.
Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.
Bellabeat membership: Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.

Main questions from stakeholders

The scope of this analysis is to get answers in the following questions:

1. What are some trends in smart device usage?
1. How could these trends apply to Bellabeat customers?
1. How could these trends help influence Bellabeat marketing strategy?

Asking deliverables

1. A clear summary of the business task
1. A description of all data sources used
1. Documentation of any cleaning or manipulation of data
1. A summary of my analysis
1. Supporting visualizations and key findings
1. My top high-level content recommendations based on my analysis

Used Data

The data is: FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius.This Kaggle data set contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.

Not so RoCCC

Reliability: Low - sample size nearly 33 people Originality: Low - Third party provider Comprehensive: Medium - The data does not include genders Cited: Low - Third party and is used from public source.

Fixing a problem with packages link

options(repos = list(CRAN="http://cran.rstudio.com/"))

Importing the neccecery packages for cleaning, formating and analyzing data

install.packages('tidyverse', repos = "https://ftp.cc.uoc.gr/mirrors/CRAN/")

## Installing package into 'C:/Users/Costa/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)

## package 'tidyverse' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Costa\AppData\Local\Temp\RtmpK0vZBP\downloaded_packages

library (tidyverse)

## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──

## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(lubridate)

## Loading required package: timechange
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(tidyr)
library(dplyr)
library(ggplot2)
install.packages("janitor")

## Installing package into 'C:/Users/Costa/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)

## package 'janitor' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Costa\AppData\Local\Temp\RtmpK0vZBP\downloaded_packages

library (janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

Importing Daily Datasets examine, clean and orginise them

Importing all the necessary databases. I choosed only the daily ones, because I believe its in the daily activities that we can find the trends. I saw their columns and I decided to change the column Activityday/date/sleep day to just Date, so I can merge them later through their unique user id and the day of the activities to a single dataframe.

Activity <- read_csv("D:/BellaBeat/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")

## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(Activity)

## # A tibble: 6 × 15
##       Id Activ…¹ Total…² Total…³ Track…⁴ Logge…⁵ VeryA…⁶ Moder…⁷ Light…⁸ Seden…⁹
##    <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 1.50e9 4/12/2…   13162    8.5     8.5        0    1.88   0.550    6.06       0
## 2 1.50e9 4/13/2…   10735    6.97    6.97       0    1.57   0.690    4.71       0
## 3 1.50e9 4/14/2…   10460    6.74    6.74       0    2.44   0.400    3.91       0
## 4 1.50e9 4/15/2…    9762    6.28    6.28       0    2.14   1.26     2.83       0
## 5 1.50e9 4/16/2…   12669    8.16    8.16       0    2.71   0.410    5.04       0
## 6 1.50e9 4/17/2…    9705    6.48    6.48       0    3.19   0.780    2.51       0
## # … with 5 more variables: VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## #   LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>, and
## #   abbreviated variable names ¹ActivityDate, ²TotalSteps, ³TotalDistance,
## #   ⁴TrackerDistance, ⁵LoggedActivitiesDistance, ⁶VeryActiveDistance,
## #   ⁷ModeratelyActiveDistance, ⁸LightActiveDistance, ⁹SedentaryActiveDistance

sapply(Activity, function(x) length(unique(x)))

##                       Id             ActivityDate               TotalSteps 
##                       33                       31                      842 
##            TotalDistance          TrackerDistance LoggedActivitiesDistance 
##                      615                      613                       19 
##       VeryActiveDistance ModeratelyActiveDistance      LightActiveDistance 
##                      333                      211                      491 
##  SedentaryActiveDistance        VeryActiveMinutes      FairlyActiveMinutes 
##                        9                      122                       81 
##     LightlyActiveMinutes         SedentaryMinutes                 Calories 
##                      335                      549                      734

Activity %>%
  rename(Date=ActivityDate)

## # A tibble: 940 × 15
##         Id Date  Total…¹ Total…² Track…³ Logge…⁴ VeryA…⁵ Moder…⁶ Light…⁷ Seden…⁸
##      <dbl> <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1  1.50e9 4/12…   13162    8.5     8.5        0    1.88   0.550    6.06       0
##  2  1.50e9 4/13…   10735    6.97    6.97       0    1.57   0.690    4.71       0
##  3  1.50e9 4/14…   10460    6.74    6.74       0    2.44   0.400    3.91       0
##  4  1.50e9 4/15…    9762    6.28    6.28       0    2.14   1.26     2.83       0
##  5  1.50e9 4/16…   12669    8.16    8.16       0    2.71   0.410    5.04       0
##  6  1.50e9 4/17…    9705    6.48    6.48       0    3.19   0.780    2.51       0
##  7  1.50e9 4/18…   13019    8.59    8.59       0    3.25   0.640    4.71       0
##  8  1.50e9 4/19…   15506    9.88    9.88       0    3.53   1.32     5.03       0
##  9  1.50e9 4/20…   10544    6.68    6.68       0    1.96   0.480    4.24       0
## 10  1.50e9 4/21…    9819    6.34    6.34       0    1.34   0.350    4.65       0
## # … with 930 more rows, 5 more variables: VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>, and abbreviated variable names
## #   ¹TotalSteps, ²TotalDistance, ³TrackerDistance, ⁴LoggedActivitiesDistance,
## #   ⁵VeryActiveDistance, ⁶ModeratelyActiveDistance, ⁷LightActiveDistance,
## #   ⁸SedentaryActiveDistance

Calories <- read_csv("D:/BellaBeat/Fitabase Data 4.12.16-5.12.16/dailyCalories_merged.csv")

## Rows: 940 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, Calories
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(Calories)

## # A tibble: 6 × 3
##           Id ActivityDay Calories
##        <dbl> <chr>          <dbl>
## 1 1503960366 4/12/2016       1985
## 2 1503960366 4/13/2016       1797
## 3 1503960366 4/14/2016       1776
## 4 1503960366 4/15/2016       1745
## 5 1503960366 4/16/2016       1863
## 6 1503960366 4/17/2016       1728

Calories %>%
  rename(Date=ActivityDay)

## # A tibble: 940 × 3
##            Id Date      Calories
##         <dbl> <chr>        <dbl>
##  1 1503960366 4/12/2016     1985
##  2 1503960366 4/13/2016     1797
##  3 1503960366 4/14/2016     1776
##  4 1503960366 4/15/2016     1745
##  5 1503960366 4/16/2016     1863
##  6 1503960366 4/17/2016     1728
##  7 1503960366 4/18/2016     1921
##  8 1503960366 4/19/2016     2035
##  9 1503960366 4/20/2016     1786
## 10 1503960366 4/21/2016     1775
## # … with 930 more rows

Steps <-read_csv ("D:/BellaBeat/Fitabase Data 4.12.16-5.12.16/dailySteps_merged.csv")

## Rows: 940 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, StepTotal
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(Steps)

## # A tibble: 6 × 3
##           Id ActivityDay StepTotal
##        <dbl> <chr>           <dbl>
## 1 1503960366 4/12/2016       13162
## 2 1503960366 4/13/2016       10735
## 3 1503960366 4/14/2016       10460
## 4 1503960366 4/15/2016        9762
## 5 1503960366 4/16/2016       12669
## 6 1503960366 4/17/2016        9705

Steps %>%
  rename(Date=ActivityDay)

## # A tibble: 940 × 3
##            Id Date      StepTotal
##         <dbl> <chr>         <dbl>
##  1 1503960366 4/12/2016     13162
##  2 1503960366 4/13/2016     10735
##  3 1503960366 4/14/2016     10460
##  4 1503960366 4/15/2016      9762
##  5 1503960366 4/16/2016     12669
##  6 1503960366 4/17/2016      9705
##  7 1503960366 4/18/2016     13019
##  8 1503960366 4/19/2016     15506
##  9 1503960366 4/20/2016     10544
## 10 1503960366 4/21/2016      9819
## # … with 930 more rows

Sleep <-read_csv ("D:/BellaBeat/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")

## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head (Sleep)

## # A tibble: 6 × 5
##           Id SleepDay              TotalSleepRecords TotalMinutesAsleep TotalT…¹
##        <dbl> <chr>                             <dbl>              <dbl>    <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327      346
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384      407
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412      442
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340      367
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700      712
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304      320
## # … with abbreviated variable name ¹TotalTimeInBed

Sleep %>%
  rename(Date=SleepDay)

## # A tibble: 413 × 5
##            Id Date                  TotalSleepRecords TotalMinutesAsleep Total…¹
##         <dbl> <chr>                             <dbl>              <dbl>   <dbl>
##  1 1503960366 4/12/2016 12:00:00 AM                 1                327     346
##  2 1503960366 4/13/2016 12:00:00 AM                 2                384     407
##  3 1503960366 4/15/2016 12:00:00 AM                 1                412     442
##  4 1503960366 4/16/2016 12:00:00 AM                 2                340     367
##  5 1503960366 4/17/2016 12:00:00 AM                 1                700     712
##  6 1503960366 4/19/2016 12:00:00 AM                 1                304     320
##  7 1503960366 4/20/2016 12:00:00 AM                 1                360     377
##  8 1503960366 4/21/2016 12:00:00 AM                 1                325     364
##  9 1503960366 4/23/2016 12:00:00 AM                 1                361     384
## 10 1503960366 4/24/2016 12:00:00 AM                 1                430     449
## # … with 403 more rows, and abbreviated variable name ¹TotalTimeInBed

Cleaning Daily Datasets

The next step is to examine,format,clean datasets for inconsistency in their inputs. First checking the data I have from each dataset: I saw their column names, I clean the dataset, I change the format of the date from characters to the Date POSIXct format, so they can all have the right input. I also make sure that the names in columns will be right by using the clean_name command. I took the information of how many unique users participated in this essay.

head(Activity)

## # A tibble: 6 × 15
##       Id Activ…¹ Total…² Total…³ Track…⁴ Logge…⁵ VeryA…⁶ Moder…⁷ Light…⁸ Seden…⁹
##    <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 1.50e9 4/12/2…   13162    8.5     8.5        0    1.88   0.550    6.06       0
## 2 1.50e9 4/13/2…   10735    6.97    6.97       0    1.57   0.690    4.71       0
## 3 1.50e9 4/14/2…   10460    6.74    6.74       0    2.44   0.400    3.91       0
## 4 1.50e9 4/15/2…    9762    6.28    6.28       0    2.14   1.26     2.83       0
## 5 1.50e9 4/16/2…   12669    8.16    8.16       0    2.71   0.410    5.04       0
## 6 1.50e9 4/17/2…    9705    6.48    6.48       0    3.19   0.780    2.51       0
## # … with 5 more variables: VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## #   LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>, and
## #   abbreviated variable names ¹ActivityDate, ²TotalSteps, ³TotalDistance,
## #   ⁴TrackerDistance, ⁵LoggedActivitiesDistance, ⁶VeryActiveDistance,
## #   ⁷ModeratelyActiveDistance, ⁸LightActiveDistance, ⁹SedentaryActiveDistance

sapply(Activity, function(x) length(unique(x)))

##                       Id             ActivityDate               TotalSteps 
##                       33                       31                      842 
##            TotalDistance          TrackerDistance LoggedActivitiesDistance 
##                      615                      613                       19 
##       VeryActiveDistance ModeratelyActiveDistance      LightActiveDistance 
##                      333                      211                      491 
##  SedentaryActiveDistance        VeryActiveMinutes      FairlyActiveMinutes 
##                        9                      122                       81 
##     LightlyActiveMinutes         SedentaryMinutes                 Calories 
##                      335                      549                      734

colnames(Activity)

##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"

clean_names(Activity)

## # A tibble: 940 × 15
##            id activity…¹ total…² total…³ track…⁴ logge…⁵ very_…⁶ moder…⁷ light…⁸
##         <dbl> <chr>        <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 1503960366 4/12/2016    13162    8.5     8.5        0    1.88   0.550    6.06
##  2 1503960366 4/13/2016    10735    6.97    6.97       0    1.57   0.690    4.71
##  3 1503960366 4/14/2016    10460    6.74    6.74       0    2.44   0.400    3.91
##  4 1503960366 4/15/2016     9762    6.28    6.28       0    2.14   1.26     2.83
##  5 1503960366 4/16/2016    12669    8.16    8.16       0    2.71   0.410    5.04
##  6 1503960366 4/17/2016     9705    6.48    6.48       0    3.19   0.780    2.51
##  7 1503960366 4/18/2016    13019    8.59    8.59       0    3.25   0.640    4.71
##  8 1503960366 4/19/2016    15506    9.88    9.88       0    3.53   1.32     5.03
##  9 1503960366 4/20/2016    10544    6.68    6.68       0    1.96   0.480    4.24
## 10 1503960366 4/21/2016     9819    6.34    6.34       0    1.34   0.350    4.65
## # … with 930 more rows, 6 more variables: sedentary_active_distance <dbl>,
## #   very_active_minutes <dbl>, fairly_active_minutes <dbl>,
## #   lightly_active_minutes <dbl>, sedentary_minutes <dbl>, calories <dbl>, and
## #   abbreviated variable names ¹activity_date, ²total_steps, ³total_distance,
## #   ⁴tracker_distance, ⁵logged_activities_distance, ⁶very_active_distance,
## #   ⁷moderately_active_distance, ⁸light_active_distance

Activity %>%
  rename(Date=ActivityDate)

## # A tibble: 940 × 15
##         Id Date  Total…¹ Total…² Track…³ Logge…⁴ VeryA…⁵ Moder…⁶ Light…⁷ Seden…⁸
##      <dbl> <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1  1.50e9 4/12…   13162    8.5     8.5        0    1.88   0.550    6.06       0
##  2  1.50e9 4/13…   10735    6.97    6.97       0    1.57   0.690    4.71       0
##  3  1.50e9 4/14…   10460    6.74    6.74       0    2.44   0.400    3.91       0
##  4  1.50e9 4/15…    9762    6.28    6.28       0    2.14   1.26     2.83       0
##  5  1.50e9 4/16…   12669    8.16    8.16       0    2.71   0.410    5.04       0
##  6  1.50e9 4/17…    9705    6.48    6.48       0    3.19   0.780    2.51       0
##  7  1.50e9 4/18…   13019    8.59    8.59       0    3.25   0.640    4.71       0
##  8  1.50e9 4/19…   15506    9.88    9.88       0    3.53   1.32     5.03       0
##  9  1.50e9 4/20…   10544    6.68    6.68       0    1.96   0.480    4.24       0
## 10  1.50e9 4/21…    9819    6.34    6.34       0    1.34   0.350    4.65       0
## # … with 930 more rows, 5 more variables: VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>, and abbreviated variable names
## #   ¹TotalSteps, ²TotalDistance, ³TrackerDistance, ⁴LoggedActivitiesDistance,
## #   ⁵VeryActiveDistance, ⁶ModeratelyActiveDistance, ⁷LightActiveDistance,
## #   ⁸SedentaryActiveDistance

Activity$ActivityDate=as.POSIXct(Activity$ActivityDate, format="%m/%d/%Y", tz=Sys.timezone())

head(Calories)

## # A tibble: 6 × 3
##           Id ActivityDay Calories
##        <dbl> <chr>          <dbl>
## 1 1503960366 4/12/2016       1985
## 2 1503960366 4/13/2016       1797
## 3 1503960366 4/14/2016       1776
## 4 1503960366 4/15/2016       1745
## 5 1503960366 4/16/2016       1863
## 6 1503960366 4/17/2016       1728

sapply(Calories, function(x) length(unique(x)))

##          Id ActivityDay    Calories 
##          33          31         734

colnames(Calories)

## [1] "Id"          "ActivityDay" "Calories"

clean_names(Calories)

## # A tibble: 940 × 3
##            id activity_day calories
##         <dbl> <chr>           <dbl>
##  1 1503960366 4/12/2016        1985
##  2 1503960366 4/13/2016        1797
##  3 1503960366 4/14/2016        1776
##  4 1503960366 4/15/2016        1745
##  5 1503960366 4/16/2016        1863
##  6 1503960366 4/17/2016        1728
##  7 1503960366 4/18/2016        1921
##  8 1503960366 4/19/2016        2035
##  9 1503960366 4/20/2016        1786
## 10 1503960366 4/21/2016        1775
## # … with 930 more rows

Calories %>%
  rename(Date=ActivityDay)

## # A tibble: 940 × 3
##            Id Date      Calories
##         <dbl> <chr>        <dbl>
##  1 1503960366 4/12/2016     1985
##  2 1503960366 4/13/2016     1797
##  3 1503960366 4/14/2016     1776
##  4 1503960366 4/15/2016     1745
##  5 1503960366 4/16/2016     1863
##  6 1503960366 4/17/2016     1728
##  7 1503960366 4/18/2016     1921
##  8 1503960366 4/19/2016     2035
##  9 1503960366 4/20/2016     1786
## 10 1503960366 4/21/2016     1775
## # … with 930 more rows

Calories$ActivityDay=as.POSIXct(Calories$ActivityDay, format="%m/%d/%Y", tz=Sys.timezone())

head(Steps)

## # A tibble: 6 × 3
##           Id ActivityDay StepTotal
##        <dbl> <chr>           <dbl>
## 1 1503960366 4/12/2016       13162
## 2 1503960366 4/13/2016       10735
## 3 1503960366 4/14/2016       10460
## 4 1503960366 4/15/2016        9762
## 5 1503960366 4/16/2016       12669
## 6 1503960366 4/17/2016        9705

sapply(Steps, function(x) length(unique(x)))

##          Id ActivityDay   StepTotal 
##          33          31         842

colnames(Steps)

## [1] "Id"          "ActivityDay" "StepTotal"

clean_names(Steps)

## # A tibble: 940 × 3
##            id activity_day step_total
##         <dbl> <chr>             <dbl>
##  1 1503960366 4/12/2016         13162
##  2 1503960366 4/13/2016         10735
##  3 1503960366 4/14/2016         10460
##  4 1503960366 4/15/2016          9762
##  5 1503960366 4/16/2016         12669
##  6 1503960366 4/17/2016          9705
##  7 1503960366 4/18/2016         13019
##  8 1503960366 4/19/2016         15506
##  9 1503960366 4/20/2016         10544
## 10 1503960366 4/21/2016          9819
## # … with 930 more rows

Steps$ActivityDay=as.POSIXct(Steps$ActivityDay, format="%m/%d/%Y", tz=Sys.timezone())
Steps %>%
  rename(Date=ActivityDay)

## # A tibble: 940 × 3
##            Id Date                StepTotal
##         <dbl> <dttm>                  <dbl>
##  1 1503960366 2016-04-12 00:00:00     13162
##  2 1503960366 2016-04-13 00:00:00     10735
##  3 1503960366 2016-04-14 00:00:00     10460
##  4 1503960366 2016-04-15 00:00:00      9762
##  5 1503960366 2016-04-16 00:00:00     12669
##  6 1503960366 2016-04-17 00:00:00      9705
##  7 1503960366 2016-04-18 00:00:00     13019
##  8 1503960366 2016-04-19 00:00:00     15506
##  9 1503960366 2016-04-20 00:00:00     10544
## 10 1503960366 2016-04-21 00:00:00      9819
## # … with 930 more rows

head(Sleep)

## # A tibble: 6 × 5
##           Id SleepDay              TotalSleepRecords TotalMinutesAsleep TotalT…¹
##        <dbl> <chr>                             <dbl>              <dbl>    <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327      346
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384      407
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412      442
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340      367
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700      712
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304      320
## # … with abbreviated variable name ¹TotalTimeInBed

sapply(Sleep, function(x) length(unique(x)))

##                 Id           SleepDay  TotalSleepRecords TotalMinutesAsleep 
##                 24                 31                  3                256 
##     TotalTimeInBed 
##                242

colnames(Sleep)

## [1] "Id"                 "SleepDay"           "TotalSleepRecords" 
## [4] "TotalMinutesAsleep" "TotalTimeInBed"

clean_names(Sleep)

## # A tibble: 413 × 5
##            id sleep_day             total_sleep_records total_minutes_…¹ total…²
##         <dbl> <chr>                               <dbl>            <dbl>   <dbl>
##  1 1503960366 4/12/2016 12:00:00 AM                   1              327     346
##  2 1503960366 4/13/2016 12:00:00 AM                   2              384     407
##  3 1503960366 4/15/2016 12:00:00 AM                   1              412     442
##  4 1503960366 4/16/2016 12:00:00 AM                   2              340     367
##  5 1503960366 4/17/2016 12:00:00 AM                   1              700     712
##  6 1503960366 4/19/2016 12:00:00 AM                   1              304     320
##  7 1503960366 4/20/2016 12:00:00 AM                   1              360     377
##  8 1503960366 4/21/2016 12:00:00 AM                   1              325     364
##  9 1503960366 4/23/2016 12:00:00 AM                   1              361     384
## 10 1503960366 4/24/2016 12:00:00 AM                   1              430     449
## # … with 403 more rows, and abbreviated variable names ¹total_minutes_asleep,
## #   ²total_time_in_bed

Sleep %>%
  rename(Date=SleepDay)

## # A tibble: 413 × 5
##            Id Date                  TotalSleepRecords TotalMinutesAsleep Total…¹
##         <dbl> <chr>                             <dbl>              <dbl>   <dbl>
##  1 1503960366 4/12/2016 12:00:00 AM                 1                327     346
##  2 1503960366 4/13/2016 12:00:00 AM                 2                384     407
##  3 1503960366 4/15/2016 12:00:00 AM                 1                412     442
##  4 1503960366 4/16/2016 12:00:00 AM                 2                340     367
##  5 1503960366 4/17/2016 12:00:00 AM                 1                700     712
##  6 1503960366 4/19/2016 12:00:00 AM                 1                304     320
##  7 1503960366 4/20/2016 12:00:00 AM                 1                360     377
##  8 1503960366 4/21/2016 12:00:00 AM                 1                325     364
##  9 1503960366 4/23/2016 12:00:00 AM                 1                361     384
## 10 1503960366 4/24/2016 12:00:00 AM                 1                430     449
## # … with 403 more rows, and abbreviated variable name ¹TotalTimeInBed

Sleep$SleepDay=as.POSIXct(Sleep$SleepDay, format="%m/%d/%Y", tz=Sys.timezone())

First conclusions

From the cleaning I find out that I don’t have the same amount of participation to all data sets. Some participants are missing. In the Activity and Intensities and Calories dataset I have 33 unique ids, but in Sleep I have 24, in Steps 24. Something very important to notice here is that the sample we have is very small for a real data analysis.Also we are already missing entries, so in real life that it wound not be suggested as a good data sourse to work with. Another very important issue is that we don’t have gender separation in the data. Also the data are not updated since they are from 2016.

Analysing the Data

First I am gonna take the average of some activity columns that are important:

Activity %>%  
  select(TotalSteps,
         TotalDistance,
         SedentaryMinutes,
         VeryActiveMinutes,
         Calories) %>%
  summary()

##    TotalSteps    TotalDistance    SedentaryMinutes VeryActiveMinutes
##  Min.   :    0   Min.   : 0.000   Min.   :   0.0   Min.   :  0.00   
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 729.8   1st Qu.:  0.00   
##  Median : 7406   Median : 5.245   Median :1057.5   Median :  4.00   
##  Mean   : 7638   Mean   : 5.490   Mean   : 991.2   Mean   : 21.16   
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:1229.5   3rd Qu.: 32.00   
##  Max.   :36019   Max.   :28.030   Max.   :1440.0   Max.   :210.00   
##     Calories   
##  Min.   :   0  
##  1st Qu.:1828  
##  Median :2134  
##  Mean   :2304  
##  3rd Qu.:2793  
##  Max.   :4900

The average total steps are 7638, the average distance 5.490. The numbers are bellow the standards.

The sedentary minutes on average is 991.2 Divided this by 60 is almost 17 hours non activity. That shows me that the participants are not very active or they didn’t wear their smart device.

Very active average was 21.16 minutes per day. The average of calories intake was 2304.

Then examine some data about sleep:

Sleep %>%  
  select(TotalSleepRecords,
  TotalMinutesAsleep,
  TotalTimeInBed) %>%
  summary()

##  TotalSleepRecords TotalMinutesAsleep TotalTimeInBed 
##  Min.   :1.000     Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:1.000     1st Qu.:361.0      1st Qu.:403.0  
##  Median :1.000     Median :433.0      Median :463.0  
##  Mean   :1.119     Mean   :419.5      Mean   :458.6  
##  3rd Qu.:1.000     3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :3.000     Max.   :796.0      Max.   :961.0

the total minutes as sleep are 419 (almost 7 hours). The number shows that overall the wearers had enough sleep. And we can see this by separate them in 3 categories:

sleepcategories <- Sleep %>%  
  group_by(Id) %>% 
  summarise(avg_time_asleep = mean(TotalMinutesAsleep)) %>% 
  mutate(type=case_when (
  avg_time_asleep < 300 ~ "need more sleep",
  avg_time_asleep >=300 & avg_time_asleep <= 420 ~ "average sleep",
  avg_time_asleep > 420 ~ "enough sleep"))
sleepcategories

## # A tibble: 24 × 3
##            Id avg_time_asleep type           
##         <dbl>           <dbl> <chr>          
##  1 1503960366            360. average sleep  
##  2 1644430081            294  need more sleep
##  3 1844505072            652  enough sleep   
##  4 1927972279            417  average sleep  
##  5 2026352035            506. enough sleep   
##  6 2320127002             61  need more sleep
##  7 2347167796            447. enough sleep   
##  8 3977333714            294. need more sleep
##  9 4020332650            349. average sleep  
## 10 4319703577            477. enough sleep   
## # … with 14 more rows

n_Average_sleepers <- sum(sleepcategories$type == 'average sleep')
n_Average_sleepers

## [1] 6

n_Need_more_sleep <- sum(sleepcategories$type == 'need more sleep')
n_Need_more_sleep

## [1] 6

n_Enough_sleep <- sum(sleepcategories$type == 'enough sleep')
n_Enough_sleep

## [1] 12

From the above we see that people who have a healthy sleep pattern are half of the number, so only 1/3 need more sleep

Steps %>%  
  select(StepTotal) %>%
  summary()

##    StepTotal    
##  Min.   :    0  
##  1st Qu.: 3790  
##  Median : 7406  
##  Mean   : 7638  
##  3rd Qu.:10727  
##  Max.   :36019

In the Steps database we can clearly see that the sample are not great walkers.

Calories %>%  
  select(Calories,
  ActivityDay) %>%
  summary()

##     Calories     ActivityDay                    
##  Min.   :   0   Min.   :2016-04-12 00:00:00.00  
##  1st Qu.:1828   1st Qu.:2016-04-19 00:00:00.00  
##  Median :2134   Median :2016-04-26 00:00:00.00  
##  Mean   :2304   Mean   :2016-04-26 06:53:37.01  
##  3rd Qu.:2793   3rd Qu.:2016-05-04 00:00:00.00  
##  Max.   :4900   Max.   :2016-05-12 00:00:00.00

In the calories, we have an average of 2304 calories which is the average in the population from NHS

Then I merge everything in a bigger dataset which will include all the categories we need to find some more trends.

merged_data <- merge( Activity, Steps, by = c('Id'))
head(merged_data)

##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366   2016-04-12      13162           8.5             8.5
## 2 1503960366   2016-04-12      13162           8.5             8.5
## 3 1503960366   2016-04-12      13162           8.5             8.5
## 4 1503960366   2016-04-12      13162           8.5             8.5
## 5 1503960366   2016-04-12      13162           8.5             8.5
## 6 1503960366   2016-04-12      13162           8.5             8.5
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.88                     0.55
## 3                        0               1.88                     0.55
## 4                        0               1.88                     0.55
## 5                        0               1.88                     0.55
## 6                        0               1.88                     0.55
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                6.06                       0                25
## 3                6.06                       0                25
## 4                6.06                       0                25
## 5                6.06                       0                25
## 6                6.06                       0                25
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  13                  328              728     1985
## 3                  13                  328              728     1985
## 4                  13                  328              728     1985
## 5                  13                  328              728     1985
## 6                  13                  328              728     1985
##   ActivityDay StepTotal
## 1  2016-04-12     13162
## 2  2016-04-13     10735
## 3  2016-04-14     10460
## 4  2016-04-15      9762
## 5  2016-04-16     12669
## 6  2016-04-17      9705

merged_data_all<-merge(merged_data, Sleep, by=c('Id'))
head(merged_data_all)

##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366   2016-05-09      12022          7.72            7.72
## 2 1503960366   2016-05-09      12022          7.72            7.72
## 3 1503960366   2016-05-09      12022          7.72            7.72
## 4 1503960366   2016-05-09      12022          7.72            7.72
## 5 1503960366   2016-05-09      12022          7.72            7.72
## 6 1503960366   2016-05-09      12022          7.72            7.72
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               3.45                     0.53
## 2                        0               3.45                     0.53
## 3                        0               3.45                     0.53
## 4                        0               3.45                     0.53
## 5                        0               3.45                     0.53
## 6                        0               3.45                     0.53
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                3.74                       0                46
## 2                3.74                       0                46
## 3                3.74                       0                46
## 4                3.74                       0                46
## 5                3.74                       0                46
## 6                3.74                       0                46
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  11                  206              835     1819
## 2                  11                  206              835     1819
## 3                  11                  206              835     1819
## 4                  11                  206              835     1819
## 5                  11                  206              835     1819
## 6                  11                  206              835     1819
##   ActivityDay StepTotal   SleepDay TotalSleepRecords TotalMinutesAsleep
## 1  2016-04-27     18134 2016-04-12                 1                327
## 2  2016-04-27     18134 2016-04-13                 2                384
## 3  2016-04-27     18134 2016-04-15                 1                412
## 4  2016-04-27     18134 2016-04-16                 2                340
## 5  2016-04-27     18134 2016-04-17                 1                700
## 6  2016-04-27     18134 2016-04-19                 1                304
##   TotalTimeInBed
## 1            346
## 2            407
## 3            442
## 4            367
## 5            712
## 6            320

In the dataset merged_data_all I have all the information I want.

Visual Trends

By comparing some data from the merged dataset we can see that:

caloriesxsteps <- ggplot(data=Activity, aes(x=TotalSteps, y=Calories))+ geom_point(color="pink")+ geom_smooth()+labs(title = "Calories Burned vs Total Daily Steps",x="Total Steps", y="Calories Burned")
caloriesxsteps

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

We have a correlation between the steps and the calories. More steps, more calories burned.

activityxsleep <- ggplot(data=merged_data_all, aes(x=VeryActiveMinutes, y=TotalMinutesAsleep))+ geom_smooth()+labs(title = "Activity vs Sleep",x="Activity", y="sleep")
activityxsleep

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

The more active day somebody have the more needs to sleep.

Sleepxcalories <- ggplot(data=merged_data_all, aes(x=TotalMinutesAsleep, y=Calories))+ geom_smooth()+labs(title = "Sleep and Calories",x="Sleep", y="Burning Calories")
Sleepxcalories

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Sleep is a great way of burning calories.

Recomentations based on the analysis

I would suggest about the Bellabeat app: + Based on the analysis, the users must be more active. They dont walk so much OR they dont wear their smart device to record their activities. So I would suggest that the Bellabeat app, should have some reminders about mobility. Make a lower time limit at about 30 minute each day, or by calculating steps (10000). When they reach this goal, they can have a badge or something rewarding, so they can continue to do this each day. + Sleep notifications together with stress reducing exercises in their mobile phone like for example breathing exercises or some kind of meditation. + The users are watching their activities, but because we have missing records or a lot time of inactivity, maybe there must be a way to automatically record the data to app, through the others smart devices, and not depending on manual entries of the users.

Answers to the questions:

1. What are some trends in smart device usage?
People use smart devices to record their everyday activity, but they don’t necessary record it everyday.
From the sample we can see that smart devices need to notify users so they can stay on track.
If someone tracks their records, its more possible to continue their healthy habits.
1. How could these trends apply to Bellabeat customers? Tracking everyday habits, makes people improve their way of living. Having better sleep, exercising more and watching their progress is a huge advance for someone to continue trying for their well-being.
1. How could these trends help influence Bellabeat marketing strategy? Bellabeat should continue produce high quality smart devices. They also should make an effort to make their app, an everyday neccecity for the user. By notify nad inform them about their progress and ever reward them when they are doing great!

BellaBeat_v1

Akindynos Koutsouras

2022-12-04