Introduction

Welcome to my case study. My analysis will be on Bellabeat, a high-tech manufacturer of health-focused projects for women.

Scenario

Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and chief creative officer at Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. I have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices.

Stakeholders

Products

Business Task

Analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices.

Questions:

  1. What are some trends in smart device usage?
  2. How could these trends apply to Bellabeat customers?
  3. How could these trends help influence Bellabeat marketing strategy?

Deliverables:

  1. A clear summary of the business task
  2. A description of all data sources used
  3. Documentation of any cleaning or manipulation of data
  4. A summary of my analysis
  5. Supporting visualizations and key findings
  6. Top high-level content recommendations based on my analysis

Data Sources

Fitbit Fitness Tracker data (CC0: Public Domain, dataset made available through Mobius) This dataset generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. This Kaggle data set contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It contains a total of 18 csv files and includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.

Limitations

  1. Outdated - Data was collected in 2016
  2. Small sample size - A sample size of 30 participants can skew our analysis and risk sampling bias
  3. Demographics - Bellabeat is a company whose products are manufactured for and used by women. This data does not specify gender or age.
  4. Third party source - Amazon is a third party source which makes our data less reliable.
  5. Short time period - Data was collected for a month which is a short time period. Therefore, I would recommend collecting our own data or using other sources.

Loading packages and preparing the data

I will use R for all phases of my analysis
install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(readr)
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(dplyr)
install.packages("here")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(here)
## here() starts at /cloud/project
install.packages("janitor")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
install.packages("skimr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(skimr)
install.packages("lubridate")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(ggplot2)

Importing data

dailyActivity_merged, sleepDay_merged

dailyActivity_merged <- read_csv("Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sleepDay_merged <- read_csv("Fitabase Data 4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Taking a closer look at imported data structure
str(dailyActivity_merged)
## spec_tbl_df [940 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id                      : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : num [1:940] 13162 10735 10460 9762 12669 ...
##  $ TotalDistance           : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : num [1:940] 728 776 1218 726 773 ...
##  $ Calories                : num [1:940] 1985 1797 1776 1745 1863 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   ActivityDate = col_character(),
##   ..   TotalSteps = col_double(),
##   ..   TotalDistance = col_double(),
##   ..   TrackerDistance = col_double(),
##   ..   LoggedActivitiesDistance = col_double(),
##   ..   VeryActiveDistance = col_double(),
##   ..   ModeratelyActiveDistance = col_double(),
##   ..   LightActiveDistance = col_double(),
##   ..   SedentaryActiveDistance = col_double(),
##   ..   VeryActiveMinutes = col_double(),
##   ..   FairlyActiveMinutes = col_double(),
##   ..   LightlyActiveMinutes = col_double(),
##   ..   SedentaryMinutes = col_double(),
##   ..   Calories = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(sleepDay_merged)
## spec_tbl_df [413 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id                : num [1:413] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ SleepDay          : chr [1:413] "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
##  $ TotalSleepRecords : num [1:413] 1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: num [1:413] 327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : num [1:413] 346 407 442 367 712 320 377 364 384 449 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   SleepDay = col_character(),
##   ..   TotalSleepRecords = col_double(),
##   ..   TotalMinutesAsleep = col_double(),
##   ..   TotalTimeInBed = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
head(dailyActivity_merged)
## # A tibble: 6 × 15
##        Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie…
##     <dbl> <chr>             <dbl>         <dbl>           <dbl>            <dbl>
## 1  1.50e9 4/12/2016         13162          8.5             8.5                 0
## 2  1.50e9 4/13/2016         10735          6.97            6.97                0
## 3  1.50e9 4/14/2016         10460          6.74            6.74                0
## 4  1.50e9 4/15/2016          9762          6.28            6.28                0
## 5  1.50e9 4/16/2016         12669          8.16            8.16                0
## 6  1.50e9 4/17/2016          9705          6.48            6.48                0
## # … with 9 more variables: VeryActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>
head(sleepDay_merged)
## # A tibble: 6 × 5
##           Id SleepDay           TotalSleepRecor… TotalMinutesAsl… TotalTimeInBed
##        <dbl> <chr>                         <dbl>            <dbl>          <dbl>
## 1 1503960366 4/12/2016 12:00:0…                1              327            346
## 2 1503960366 4/13/2016 12:00:0…                2              384            407
## 3 1503960366 4/15/2016 12:00:0…                1              412            442
## 4 1503960366 4/16/2016 12:00:0…                2              340            367
## 5 1503960366 4/17/2016 12:00:0…                1              700            712
## 6 1503960366 4/19/2016 12:00:0…                1              304            320
colnames(dailyActivity_merged)
##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"
colnames(sleepDay_merged)
## [1] "Id"                 "SleepDay"           "TotalSleepRecords" 
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
How many unique participants are there in each dataframe?
n_distinct(dailyActivity_merged$Id)
## [1] 33
n_distinct(sleepDay_merged$Id)
## [1] 24
Check for duplicate rows
nrow(dailyActivity_merged)
## [1] 940
nrow(sleepDay_merged)
## [1] 413
nrow(unique(dailyActivity_merged))
## [1] 940
nrow(unique(sleepDay_merged))
## [1] 410
Removing duplicate rows from sleep_day_merged
sleepDay <- unique(sleepDay_merged)
Create a new data frame, daily_activity_1 and rename columns
daily_activity_1 <- dailyActivity_merged %>%
  select("Id","Date"= "ActivityDate","TotalSteps", "SedentaryMinutes", "VeryActiveMinutes","FairlyActiveMinutes", "LightlyActiveMinutes", "Calories")
view(daily_activity_1)
Formatting dates in daily_activity_1
daily_activity_1$date <- mdy(daily_activity_1$Date)
Create a new dataframe, sleep_1 and convert Total Minutes Asleep to Total Hours asleep
sleep_1 <- sleepDay %>%
  select("Id", "SleepDay", "TotalMinutesAsleep", "TotalTimeInBed")%>%
  filter(TotalMinutesAsleep !=0)
sleep_1$Total_hrs_asleep <- round(sleep_1$TotalMinutesAsleep/60)
Merge daily_activity_1 to sleep_1 to create merged_data dataframe
merged_data <- merge(daily_activity_1, sleep_1, by = "Id")
summary(merged_data)
##        Id                Date             TotalSteps    SedentaryMinutes
##  Min.   :1.504e+09   Length:12348       Min.   :    0   Min.   :   0.0  
##  1st Qu.:3.977e+09   Class :character   1st Qu.: 4660   1st Qu.: 659.0  
##  Median :4.703e+09   Mode  :character   Median : 8585   Median : 734.0  
##  Mean   :5.021e+09                      Mean   : 8108   Mean   : 799.4  
##  3rd Qu.:6.962e+09                      3rd Qu.:11317   3rd Qu.: 853.0  
##  Max.   :8.792e+09                      Max.   :22988   Max.   :1440.0  
##  VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes    Calories   
##  Min.   :  0.00    Min.   :  0.00      Min.   :  0.0        Min.   :   0  
##  1st Qu.:  0.00    1st Qu.:  0.00      1st Qu.:144.0        1st Qu.:1776  
##  Median :  8.00    Median : 10.00      Median :200.0        Median :2158  
##  Mean   : 23.94    Mean   : 17.34      Mean   :199.8        Mean   :2323  
##  3rd Qu.: 36.00    3rd Qu.: 24.00      3rd Qu.:258.0        3rd Qu.:2859  
##  Max.   :210.00    Max.   :143.00      Max.   :518.0        Max.   :4900  
##       date              SleepDay         TotalMinutesAsleep TotalTimeInBed 
##  Min.   :2016-04-12   Length:12348       Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:2016-04-19   Class :character   1st Qu.:361.0      1st Qu.:402.0  
##  Median :2016-04-27   Mode  :character   Median :432.0      Median :462.0  
##  Mean   :2016-04-26                      Mean   :419.1      Mean   :458.2  
##  3rd Qu.:2016-05-04                      3rd Qu.:492.0      3rd Qu.:526.0  
##  Max.   :2016-05-12                      Max.   :796.0      Max.   :961.0  
##  Total_hrs_asleep
##  Min.   : 1.00   
##  1st Qu.: 6.00   
##  Median : 7.00   
##  Mean   : 6.99   
##  3rd Qu.: 8.00   
##  Max.   :13.00
n_distinct(merged_data$Id)
## [1] 24
Remove Nulls for Total Steps and Calories
merged_data_2 <- merged_data %>%
 filter(TotalSteps !=0)%>%
  filter(Calories != 0)%>%
view(merged_data_2)
Calculating sum of minutes activity
VeryActiveMins <- sum(daily_activity_1$VeryActiveMinutes)
FairlyActiveMins <- sum(daily_activity_1$FairlyActiveMinutes)
LightlyActiveMins <- sum(daily_activity_1$LightlyActiveMinutes)
SedentaryMins <- sum(daily_activity_1$SedentaryMinutes)
TotalMinsActivity <- VeryActiveMins + FairlyActiveMins + LightlyActiveMins + SedentaryMins

Visualizing Data

ggplot(data = daily_activity_1)+
  geom_point(mapping = aes(x = SedentaryMinutes, y = TotalSteps), color = "red")+
  labs(title = "Total Steps v's Sedentary Minutes")

ggplot(data = daily_activity_1)+
  geom_point(mapping = aes(x = LightlyActiveMinutes, y = TotalSteps), color = "dark green")+
  labs(title = "Total Steps v's Lightly Active  Minutes")

ggplot(data = daily_activity_1)+
  geom_point(mapping = aes(x = VeryActiveMinutes, y = TotalSteps), color = "orange")+
  labs(title = "Total Steps v's Very Active  Minutes")

The above graphs show the relationship between daily steps and active minutes. Most participants seem to be sedentary to lightly active.

slices <- c(VeryActiveMins,FairlyActiveMins,LightlyActiveMins,SedentaryMins)
lbls <- c("VeryActive","FairlyActive","LightlyActive","Sedentary")
pct <- round(slices/sum(slices)*100)
lbls <- paste(lbls, pct)
lbls <- paste(lbls, "%", sep="")
pie(slices, labels = lbls, col = topo.colors(length(lbls)), main = "Percentage of Activity")

This pie chart clearly shows the percent of sedentary minutes recorded over 1 month by participants

ggplot(data = merged_data_2)+
  geom_point(mapping= aes(x= TotalMinutesAsleep, y= TotalTimeInBed), color = "blue")+
  labs(title = "Time in bed v's Minutes Asleep")

This graph shows that in general, most participants spent their time in bed sleeping.

ggplot(data=daily_activity_1)+
  geom_point(mapping = aes(x =TotalSteps, y = Calories), color = "purple")+
               geom_smooth(mapping = aes(x = TotalSteps, y = Calories)) +
  labs(title = "Total Steps v's Calories")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

This graph shows the positive relationship between Total Steps and calories burned.

ggplot(data = merged_data_2)+
  geom_bar(mapping = aes(x = Total_hrs_asleep, fill = TotalSteps))+
  labs(title="Total steps v's Sleep", x="Hours Sleep", y="Total Steps")

This graph highlights the positive relationship between hours slept and daily activity

Key Findings

Recommendations