Introductio

This document responds to the case study of the Google Data Analytics Certificate Case Study 2. The case study is designed to figure out if there is a positive interest in smart device usage, such as smartphone, smart watch, etc, and how this could help figure out if it is a good business for the company Bellabeat to invest in services and products that depends on the use of smart devices.

The case study is organized in accordance to the six steps of data analysis process: Ask, Prepare, Process, Analyse, Share, and Act.

Ask: The Business Task Statement

To analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices and how that could influence Bellabeat’s product (App, Leaf, and Time) marketing strategy.

Prepare: A Description of all Data Sources Used

Fitbit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius): This Kaggle data set contains personal fitness tracker from thirty fitbit users.

Comments on the dataset: Fitbit Fitness Tracker Data is limited and does not include demographic information, the gender and age group of the people who contributed to the data.

Process: Documentation of any cleaning or manipulation of data

The data cleaning was done with R Studio. The documentation are detailed below.

Loading the necessary packages

Note: Here, the ‘tidyverse’ library is loaded

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Import the data from the .csv file

Here, the .csv files are imported and stored in data frames

dailyActivity_merged_df <- read_csv("dailyActivity_merged.csv")

We’ll create another dataframe for the sleep data.

sleep_day_df <- read.csv("sleepDay_merged.csv")

Exploring a few key tables

Take a look at the dailyActivity_merged_df data.

head(dailyActivity_merged_df)
## # A tibble: 6 × 15
##       Id Activ…¹ Total…² Total…³ Track…⁴ Logge…⁵ VeryA…⁶ Moder…⁷ Light…⁸ Seden…⁹
##    <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 1.50e9 4/12/2…   13162    8.5     8.5        0    1.88   0.550    6.06       0
## 2 1.50e9 4/13/2…   10735    6.97    6.97       0    1.57   0.690    4.71       0
## 3 1.50e9 4/14/2…   10460    6.74    6.74       0    2.44   0.400    3.91       0
## 4 1.50e9 4/15/2…    9762    6.28    6.28       0    2.14   1.26     2.83       0
## 5 1.50e9 4/16/2…   12669    8.16    8.16       0    2.71   0.410    5.04       0
## 6 1.50e9 4/17/2…    9705    6.48    6.48       0    3.19   0.780    2.51       0
## # … with 5 more variables: VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## #   LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>, and
## #   abbreviated variable names ¹​ActivityDate, ²​TotalSteps, ³​TotalDistance,
## #   ⁴​TrackerDistance, ⁵​LoggedActivitiesDistance, ⁶​VeryActiveDistance,
## #   ⁷​ModeratelyActiveDistance, ⁸​LightActiveDistance, ⁹​SedentaryActiveDistance

Identify all the columsn in the dailyActivity_merged_df data.

colnames(dailyActivity_merged_df)
##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"

Take a look at the sleep_day_df data.

head(sleep_day_df)
##           Id              SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304
##   TotalTimeInBed
## 1            346
## 2            407
## 3            442
## 4            367
## 5            712
## 6            320

Identify all the columsn in the sleep_day_df data.

colnames(sleep_day_df)
## [1] "Id"                 "SleepDay"           "TotalSleepRecords" 
## [4] "TotalMinutesAsleep" "TotalTimeInBed"

Note that both datasets have the ‘Id’ field - this can be used to merge the datasets.

Understanding some summary statistics

How many unique participants are there in each dataframe? It looks like there may be more participants in the daily activity dataset than the sleep dataset.

n_distinct(dailyActivity_merged_df$Id)
## [1] 33
n_distinct(sleep_day_df$Id)
## [1] 24

How many observations are there in each dataframe?

nrow(dailyActivity_merged_df)
## [1] 940
nrow(sleep_day_df)
## [1] 413

Some quick summary statistics we’d want to know about each data frame?

For the daily activity dataframe:

dailyActivity_merged_df %>%  
  select(TotalSteps,
         TotalDistance,
         SedentaryMinutes) %>%
  summary()
##    TotalSteps    TotalDistance    SedentaryMinutes
##  Min.   :    0   Min.   : 0.000   Min.   :   0.0  
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 729.8  
##  Median : 7406   Median : 5.245   Median :1057.5  
##  Mean   : 7638   Mean   : 5.490   Mean   : 991.2  
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:1229.5  
##  Max.   :36019   Max.   :28.030   Max.   :1440.0

For the sleep dataframe:

sleep_day_df %>%  
  select(TotalSleepRecords,
  TotalMinutesAsleep,
  TotalTimeInBed) %>%
  summary()
##  TotalSleepRecords TotalMinutesAsleep TotalTimeInBed 
##  Min.   :1.000     Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:1.000     1st Qu.:361.0      1st Qu.:403.0  
##  Median :1.000     Median :433.0      Median :463.0  
##  Mean   :1.119     Mean   :419.5      Mean   :458.6  
##  3rd Qu.:1.000     3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :3.000     Max.   :796.0      Max.   :961.0

What does this tell us about how this sample of people’s activities? It seems most of the sedentary minutes are spent in bed, and most of the time spent in bed are spent asleep.

Plotting a few explorations

For the daily activity dataframe:

ggplot(data=dailyActivity_merged_df, aes(x=TotalSteps, y=SedentaryMinutes)) + geom_smooth() + geom_point(mapping = aes(color = Calories))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

What’s the relationship between steps taken in a day and sedentary minutes? How could this help inform the customer segments that we can market to? E.g. position this more as a way to get started in walking more? Or to measure steps that you’re already taking? They seen to be a general tendency that the more the steps taken the lesser the sedentary life (negative correlation), and the more the steps taken the more the calories burned (positive correlation).This means that the device can be positioned to the customers as a way to walk more and burn more calories.

For the sleep dataframe:

ggplot(data=sleep_day_df, aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) +
  geom_smooth() +
  geom_point()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

What’s the relationship between minutes asleep and time in bed? You might expect it to be almost completely linear - are there any unexpected trends? There is a strong linear relationship that the more time spent in bed the more minutes asleep.

What could these trends tell you about how to help market this product? Or areas where you might want to explore further?

Summary statistics to support inferences

Checking for the correlation between variables to confirm relationships suggested above.

For the daily activity dataframe:

dailyActivity_merged_df %>% 
  summarise(cor(TotalSteps, SedentaryMinutes), cor(TotalSteps, Calories))
## # A tibble: 1 × 2
##   `cor(TotalSteps, SedentaryMinutes)` `cor(TotalSteps, Calories)`
##                                 <dbl>                       <dbl>
## 1                              -0.327                       0.592

For the sleep dataframe:

sleep_day_df %>% 
  summarise(cor(TotalMinutesAsleep, TotalTimeInBed))
##   cor(TotalMinutesAsleep, TotalTimeInBed)
## 1                               0.9304575

Merging these two datasets together

Note: You could set “all = TRUE” to keep all the Ids intact

combined_data <- merge(sleep_day_df, dailyActivity_merged_df, by="Id", all = FALSE)

Take a look at how many participants are in this data set.

n_distinct(combined_data$Id)
## [1] 24

Further explorations

colnames(combined_data)
##  [1] "Id"                       "SleepDay"                
##  [3] "TotalSleepRecords"        "TotalMinutesAsleep"      
##  [5] "TotalTimeInBed"           "ActivityDate"            
##  [7] "TotalSteps"               "TotalDistance"           
##  [9] "TrackerDistance"          "LoggedActivitiesDistance"
## [11] "VeryActiveDistance"       "ModeratelyActiveDistance"
## [13] "LightActiveDistance"      "SedentaryActiveDistance" 
## [15] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [17] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [19] "Calories"

Do participants who sleep more also take more steps or fewer steps per day?

ggplot(data = combined_data, aes(x = TotalMinutesAsleep, y = TotalSteps)) + 
  geom_smooth() +
  geom_point(mapping = aes(color = TotalDistance))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Check if there is a numeric correlation.

combined_data %>% 
  summarise(cor(TotalMinutesAsleep, TotalSteps))
##   cor(TotalMinutesAsleep, TotalSteps)
## 1                         -0.09854146

There is no relationship between time spent asleep and number of steps taken. Hence, this product does not disrupt the regular sleep pattern of the users.

Analyze: A Summary of the Analysis

The Fitabase Data shows that:

Share: Supporting Visualizations of Key Findings

ggplot(data=dailyActivity_merged_df, aes(x=TotalSteps, y=SedentaryMinutes)) + 
  geom_smooth() + 
  geom_point(mapping = aes(color = Calories)) +
  labs(title = "Fitabase Data: Sedentary Minutes vs. Total Steps", subtitle = "Sample of the Amount of Calories Burned", 
       caption = "Data collected by Fitbit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius)") 
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data=sleep_day_df, aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) +
  geom_smooth() +
  geom_point(mapping = aes(color = TotalSleepRecords)) +
  labs(title = "Fitabase Data: Total Time In Bed vs. Total Minutes Asleep", subtitle = "Sample of the Total Sleep Records", 
       caption = "Data collected by Fitbit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius)")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = combined_data, aes(x = TotalMinutesAsleep, y = TotalSteps)) + 
  geom_smooth() +
  geom_point(mapping = aes(color = TotalDistance)) +
  labs(title = "Fitabase Data: Total Steps vs. Total Minutes Asleep", subtitle = "Sample of the Total Distance", 
       caption = "Data collected by Fitbit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius)")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Act: High-level Insights Based on Analysis

Marketing strategy based on the analysis: