Summary

Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, co-founder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. Bellabeat products became available through a growing number of online retailers in addition to their own e-commerce channel on their website. The company has invested in traditional advertising media, such as radio, out-of-home billboards, print, and television, but focuses on digital marketing extensively. Bellabeat invests year-round in Google Search, maintaining active Facebook and Instagram pages, and consistently engages consumers on Twitter. Additionally, Bellabeat runs video ads on YouTube and display ads on the Google Display Network to support campaigns around key marketing dates.

Products

  • Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.

  • Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.

  • Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.

  • Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.

1. Ask Phase

1.1 Business Task

Analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices in-other to apply insights to Bellabeat a marketing strategy

1.2 Stakeholders
  • UrÅ¡ka SrÅ¡en - Bellabeat’s co-founder and Chief Creative Officer

  • Sando Mur - Bellabeat’s co-founder and key member of the Bellabeat executive team

  • Bellabeat marketing analytics team

2. Prepare Phrase

3. Process Phase

Excel was used in formatting dates and removing duplicates from sleep_day data.

3.1 Installing and loading common packages and libraries

install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(ggplot2)
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(dplyr)
install.packages("janitor")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
install.packages("ggpubr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(ggpubr)

3.2 Import CSV files to R

daily_activity <- read.csv("dailyActivity_merged.csv")
sleep_day <- read.csv("sleepDay_merged.csv")

3.2.1 Exploring Dataset table

Take a look at daily_activity

head(daily_activity)
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366    4/12/2016      13162          8.50            8.50
## 2 1503960366    4/13/2016      10735          6.97            6.97
## 3 1503960366    4/14/2016      10460          6.74            6.74
## 4 1503960366    4/15/2016       9762          6.28            6.28
## 5 1503960366    4/16/2016      12669          8.16            8.16
## 6 1503960366    4/17/2016       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728

identify daily_activity columns

colnames(daily_activity)
##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"

Take a look at sleep_day

head(sleep_day)
##           Id          SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 0:00:00                 1                327
## 2 1503960366 4/13/2016 0:00:00                 2                384
## 3 1503960366 4/15/2016 0:00:00                 1                412
## 4 1503960366 4/16/2016 0:00:00                 2                340
## 5 1503960366 4/17/2016 0:00:00                 1                700
## 6 1503960366 4/19/2016 0:00:00                 1                304
##   TotalTimeInBed
## 1            346
## 2            407
## 3            442
## 4            367
## 5            712
## 6            320

identify sleep_day columns

colnames(sleep_day)
## [1] "Id"                 "SleepDay"           "TotalSleepRecords" 
## [4] "TotalMinutesAsleep" "TotalTimeInBed"

3.2.2 Understanding some summary statistics

Checking for how many unique participants are in each data-frame?

n_distinct(daily_activity$Id)
## [1] 33
n_distinct(sleep_day$Id)
## [1] 24

What is the number of observations in each data-frame?

nrow(daily_activity)
## [1] 940
nrow(sleep_day)
## [1] 410

Observation

  • The daily activity data sets contains 33 participants , while the sleep data set contains 24 participants, according to the report.

  • There are more participants in the daily activity data (33) than number of participants in the study (30)

3.3 Column Cleaning

For merging reasons,column names will be changed to lower case so they are all unique and consistent.

clean_names(daily_activity)
daily_activity<- rename_with(daily_activity, tolower)
clean_names(sleep_day)
sleep_day <- rename_with(sleep_day, tolower)

4. Analyze and Share Phase

Brief summary statistics regarding each data frame that we’d want to know?

daily_activity %>%  
  select(totalsteps,
         totaldistance,
         sedentaryminutes,
         calories) %>%
  summary()
##    totalsteps    totaldistance    sedentaryminutes    calories   
##  Min.   :    0   Min.   : 0.000   Min.   :   0.0   Min.   :   0  
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 729.8   1st Qu.:1828  
##  Median : 7406   Median : 5.245   Median :1057.5   Median :2134  
##  Mean   : 7638   Mean   : 5.490   Mean   : 991.2   Mean   :2304  
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:1229.5   3rd Qu.:2793  
##  Max.   :36019   Max.   :28.030   Max.   :1440.0   Max.   :4900
sleep_day %>%  
  select(totalsleeprecords,
         totalminutesasleep,
         totaltimeinbed) %>%
  summary()
##  totalsleeprecords totalminutesasleep totaltimeinbed 
##  Min.   :1.00      Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:1.00      1st Qu.:361.0      1st Qu.:403.8  
##  Median :1.00      Median :432.5      Median :463.0  
##  Mean   :1.12      Mean   :419.2      Mean   :458.5  
##  3rd Qu.:1.00      3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :3.00      Max.   :796.0      Max.   :961.0
daily_activity %>%
  select(veryactiveminutes,
         fairlyactiveminutes,
         lightlyactiveminutes) %>%
           summary()
##  veryactiveminutes fairlyactiveminutes lightlyactiveminutes
##  Min.   :  0.00    Min.   :  0.00      Min.   :  0.0       
##  1st Qu.:  0.00    1st Qu.:  0.00      1st Qu.:127.0       
##  Median :  4.00    Median :  6.00      Median :199.0       
##  Mean   : 21.16    Mean   : 13.56      Mean   :192.8       
##  3rd Qu.: 32.00    3rd Qu.: 19.00      3rd Qu.:264.0       
##  Max.   :210.00    Max.   :143.00      Max.   :518.0

Key Takeaway

4.1 Plotting a few explorations

What’s the relationship between steps taken in a day and sedentary minutes?

ggplot(data = daily_activity, mapping = aes(x=totalsteps,y=sedentaryminutes))+geom_point(color = "#0072B2")+geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

What’s the relationship between minutes asleep and time in bed?

ggplot(data = sleep_day,mapping = aes(x=totalminutesasleep,y=totaltimeinbed))+geom_point(color = "#E69F00")+geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

What’s the relationship between Total steps and calories?

ggplot(data = daily_activity, mapping = aes(x=totalsteps,y=calories))+geom_point(color= "#CC79A7")+geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

4.2 Merge both data for further exploration

We want to explore certain areas by merging daily_activity and sleep_day to see any correlation between variables by using id as the primary key

combine_data <- merge(sleep_day,daily_activity, by="id")

4.2.1 confirm all data merged

n_distinct(combine_data$id)
## [1] 24

There were more participant Ids in the daily activity data-set that were filtered out with merging. Outer_join will be used to keep those filtered out back in the data-set.

4.2.2 Use outer join to merge all id

combine_data <- merge(sleep_day, daily_activity, by="id", all = TRUE)
n_distinct(combine_data$id)
## [1] 33
glimpse(combine_data)
## Rows: 12,575
## Columns: 19
## $ id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ sleepday                 <chr> "4/12/2016 0:00:00", "4/12/2016 0:00:00", "4/…
## $ totalsleeprecords        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ totalminutesasleep       <int> 327, 327, 327, 327, 327, 327, 327, 327, 327, …
## $ totaltimeinbed           <int> 346, 346, 346, 346, 346, 346, 346, 346, 346, …
## $ activitydate             <chr> "5/7/2016", "5/6/2016", "5/1/2016", "4/30/201…
## $ totalsteps               <int> 11992, 12159, 10602, 14673, 13162, 10735, 153…
## $ totaldistance            <dbl> 7.71, 8.03, 6.81, 9.25, 8.50, 6.97, 9.80, 8.9…
## $ trackerdistance          <dbl> 7.71, 8.03, 6.81, 9.25, 8.50, 6.97, 9.80, 8.9…
## $ loggedactivitiesdistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ veryactivedistance       <dbl> 2.46, 1.97, 2.29, 3.56, 1.88, 1.57, 5.29, 2.9…
## $ moderatelyactivedistance <dbl> 2.12, 0.25, 1.60, 1.42, 0.55, 0.69, 0.57, 1.0…
## $ lightactivedistance      <dbl> 3.13, 5.81, 2.92, 4.27, 6.06, 4.71, 3.94, 4.8…
## $ sedentaryactivedistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ veryactiveminutes        <int> 37, 24, 33, 52, 25, 21, 73, 45, 48, 16, 31, 7…
## $ fairlyactiveminutes      <int> 46, 6, 35, 34, 13, 19, 14, 24, 28, 12, 23, 11…
## $ lightlyactiveminutes     <int> 175, 289, 246, 217, 328, 217, 216, 250, 189, …
## $ sedentaryminutes         <int> 833, 754, 730, 712, 728, 776, 814, 857, 782, …
## $ calories                 <int> 1821, 1896, 1820, 1947, 1985, 1797, 2013, 195…

4.3 Based on the CDC’s guidelines; categorize users by activity level.

  • Sedentary - Less than 5000 steps a day.
  • Lightly active - Between 5000 and 7499 steps a day.
  • Fairly active - Between 7500 and 9999 steps a day.
  • Very active - More than 10000 steps a day.
ActivityLevel <- combine_data %>%    
  group_by(id) %>%    
  summarise(avg_daily_step= mean(totalsteps)) %>%    
  mutate (ActivityLevel=case_when (     
    avg_daily_step <5000 ~ "Sedentary",     
    avg_daily_step >=5000 & avg_daily_step< 8000 ~"Fairly Active",     
    avg_daily_step>=8000 & avg_daily_step <12000~"Active",     
    avg_daily_step >=12000 ~ 'Highly Active'))
head(ActivityLevel)
## # A tibble: 6 × 3
##           id avg_daily_step ActivityLevel
##        <dbl>          <dbl> <chr>        
## 1 1503960366         12117. Highly Active
## 2 1624580081          5744. Fairly Active
## 3 1644430081          7283. Fairly Active
## 4 1844505072          2580. Sedentary    
## 5 1927972279           916. Sedentary    
## 6 2022484408         11371. Active

4.3.1 Based on the CDC’s guidelines; categorize users by Hours of sleep.

Adults should take 7 or more hours of sleep every night, according to the American Academy of Sleep Medicine and the Sleep Research Society.

  • Unhealthy sleepers - Less than 300 minutes(5hrs)
  • Average sleepers - Between 300 minutes and 420 minutes(7hrs)
  • Healthy Sleepers - More than 420 minutes.
SleepHours <- combine_data %>%    
  group_by(id) %>%   
  summarise(avg_time_asleep = mean(totalminutesasleep)) %>%    
  mutate(type=case_when (   
    avg_time_asleep < 300 ~ "unhealthy sleep",   
    avg_time_asleep >=300 & avg_time_asleep <= 420 ~ "average sleep",   
    avg_time_asleep > 420 ~ "healthy sleep")) 
head(SleepHours)
## # A tibble: 6 × 3
##           id avg_time_asleep type           
##        <dbl>           <dbl> <chr>          
## 1 1503960366            360. average sleep  
## 2 1624580081             NA  <NA>           
## 3 1644430081            294  unhealthy sleep
## 4 1844505072            652  healthy sleep  
## 5 1927972279            417  average sleep  
## 6 2022484408             NA  <NA>

4.4 Plotting for futher exploration

lets take a look at the average activity Level of each users

ggplot(data = ActivityLevel, mapping = aes(x=ActivityLevel, fill = ActivityLevel))+
  geom_bar(width = 0.60)+ scale_fill_manual(values = c("#000000", "#E69F00","#56B4E9","#0072B2"))+labs(title="Activity Level")

Sleeping Hours

ggplot(data = SleepHours, mapping = aes(x= type, fill = type))+
  geom_bar()+ scale_fill_manual(values = c("#000000", "#E69F00","#56B4E9"))+labs(title="Sleep Hours")

4.5 Correlations among the data

p1 = ggplot(data = daily_activity, mapping = aes(x=totalsteps,y=sedentaryminutes))+geom_point(color = "#0072B2")+geom_smooth()+labs(title="Total Steps vs. Sedentary Minutes")

p2 = ggplot(data = sleep_day,mapping = aes(x=totalminutesasleep,y=totaltimeinbed))+geom_point(color = "#E69F00")+geom_smooth()+labs(title="Minutes Asleep vs. Time in Bed")

p3 = ggplot(data = daily_activity, mapping = aes(x=totalsteps,y=calories))+geom_point(color= "#CC79A7")+geom_smooth()+labs(title="Total Steps vs. Calories")

ggpubr::ggarrange(p1, p2, p3, ncol = 2, nrow = 2)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Summary

  • The first graph shows there is an inversely associated declining trend, which may indicate that people who took more steps did not spend as many sedentary minutes. The number of sedentary minutes will decrease as the number of steps increases.

  • The second graph is almost entirely linear; the majority of users spend the same amount of time sleeping as they do in bed. Only a few users spend more time in bed than they do sleeping.

  • The third graph shows there is a positive correlation between total steps and calories burnt, which suggests the calories burnt will grow as the overall number of steps increases as expected.

5. Act Phase (Conclusion)

Following our analysis, we discovered some trends that may aid marketing strategy and improve the Bellabeat app. Bellabeat can highlight the advantages of a healthy lifestyle by using the correlation between steps taken and calories burnt to urge users to be active and track their daily progress.

Thank you for taking the time to read about my first R project.