This document responds to the case study of the Google Data Analytics Certificate Case Study 2. The case study is designed to figure out if there is a positive interest in smart device usage, such as smartphone, smart watch, etc, and how this could help figure out if it is a good business for the company Bellabeat to invest in services and products that depends on the use of smart devices.
The case study is organized in accordance to the six steps of data analysis process: Ask, Prepare, Process, Analyse, Share, and Act.
To analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices and how that could influence Bellabeat’s product (App, Leaf, and Time) marketing strategy.
Fitbit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius): This Kaggle data set contains personal fitness tracker from thirty fitbit users.
Comments on the dataset: Fitbit Fitness Tracker Data is limited and does not include demographic information, the gender and age group of the people who contributed to the data.
The data cleaning was done with R Studio. The documentation are detailed below.
Note: Here, the ‘tidyverse’ library is loaded
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Here, the .csv files are imported and stored in data frames
dailyActivity_merged_df <- read_csv("dailyActivity_merged.csv")
We’ll create another dataframe for the sleep data.
sleep_day_df <- read.csv("sleepDay_merged.csv")
Take a look at the dailyActivity_merged_df data.
head(dailyActivity_merged_df)
## # A tibble: 6 × 15
## Id Activ…¹ Total…² Total…³ Track…⁴ Logge…⁵ VeryA…⁶ Moder…⁷ Light…⁸ Seden…⁹
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2… 13162 8.5 8.5 0 1.88 0.550 6.06 0
## 2 1.50e9 4/13/2… 10735 6.97 6.97 0 1.57 0.690 4.71 0
## 3 1.50e9 4/14/2… 10460 6.74 6.74 0 2.44 0.400 3.91 0
## 4 1.50e9 4/15/2… 9762 6.28 6.28 0 2.14 1.26 2.83 0
## 5 1.50e9 4/16/2… 12669 8.16 8.16 0 2.71 0.410 5.04 0
## 6 1.50e9 4/17/2… 9705 6.48 6.48 0 3.19 0.780 2.51 0
## # … with 5 more variables: VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## # LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>, and
## # abbreviated variable names ¹ActivityDate, ²TotalSteps, ³TotalDistance,
## # ⁴TrackerDistance, ⁵LoggedActivitiesDistance, ⁶VeryActiveDistance,
## # ⁷ModeratelyActiveDistance, ⁸LightActiveDistance, ⁹SedentaryActiveDistance
Identify all the columsn in the dailyActivity_merged_df data.
colnames(dailyActivity_merged_df)
## [1] "Id" "ActivityDate"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories"
Take a look at the sleep_day_df data.
head(sleep_day_df)
## Id SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM 1 327
## 2 1503960366 4/13/2016 12:00:00 AM 2 384
## 3 1503960366 4/15/2016 12:00:00 AM 1 412
## 4 1503960366 4/16/2016 12:00:00 AM 2 340
## 5 1503960366 4/17/2016 12:00:00 AM 1 700
## 6 1503960366 4/19/2016 12:00:00 AM 1 304
## TotalTimeInBed
## 1 346
## 2 407
## 3 442
## 4 367
## 5 712
## 6 320
Identify all the columsn in the sleep_day_df data.
colnames(sleep_day_df)
## [1] "Id" "SleepDay" "TotalSleepRecords"
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
Note that both datasets have the ‘Id’ field - this can be used to merge the datasets.
How many unique participants are there in each dataframe? It looks like there may be more participants in the daily activity dataset than the sleep dataset.
n_distinct(dailyActivity_merged_df$Id)
## [1] 33
n_distinct(sleep_day_df$Id)
## [1] 24
How many observations are there in each dataframe?
nrow(dailyActivity_merged_df)
## [1] 940
nrow(sleep_day_df)
## [1] 413
Some quick summary statistics we’d want to know about each data frame?
For the daily activity dataframe:
dailyActivity_merged_df %>%
select(TotalSteps,
TotalDistance,
SedentaryMinutes) %>%
summary()
## TotalSteps TotalDistance SedentaryMinutes
## Min. : 0 Min. : 0.000 Min. : 0.0
## 1st Qu.: 3790 1st Qu.: 2.620 1st Qu.: 729.8
## Median : 7406 Median : 5.245 Median :1057.5
## Mean : 7638 Mean : 5.490 Mean : 991.2
## 3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.:1229.5
## Max. :36019 Max. :28.030 Max. :1440.0
For the sleep dataframe:
sleep_day_df %>%
select(TotalSleepRecords,
TotalMinutesAsleep,
TotalTimeInBed) %>%
summary()
## TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## Min. :1.000 Min. : 58.0 Min. : 61.0
## 1st Qu.:1.000 1st Qu.:361.0 1st Qu.:403.0
## Median :1.000 Median :433.0 Median :463.0
## Mean :1.119 Mean :419.5 Mean :458.6
## 3rd Qu.:1.000 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :3.000 Max. :796.0 Max. :961.0
What does this tell us about how this sample of people’s activities? It seems most of the sedentary minutes are spent in bed, and most of the time spent in bed are spent asleep.
For the daily activity dataframe:
ggplot(data=dailyActivity_merged_df, aes(x=TotalSteps, y=SedentaryMinutes)) + geom_smooth() + geom_point(mapping = aes(color = Calories))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
What’s the relationship between steps taken in a day and sedentary minutes? How could this help inform the customer segments that we can market to? E.g. position this more as a way to get started in walking more? Or to measure steps that you’re already taking? They seen to be a general tendency that the more the steps taken the lesser the sedentary life (negative correlation), and the more the steps taken the more the calories burned (positive correlation).This means that the device can be positioned to the customers as a way to walk more and burn more calories.
For the sleep dataframe:
ggplot(data=sleep_day_df, aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) +
geom_smooth() +
geom_point()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
What’s the relationship between minutes asleep and time in bed? You might expect it to be almost completely linear - are there any unexpected trends? There is a strong linear relationship that the more time spent in bed the more minutes asleep.
What could these trends tell you about how to help market this product? Or areas where you might want to explore further?
Checking for the correlation between variables to confirm relationships suggested above.
For the daily activity dataframe:
dailyActivity_merged_df %>%
summarise(cor(TotalSteps, SedentaryMinutes), cor(TotalSteps, Calories))
## # A tibble: 1 × 2
## `cor(TotalSteps, SedentaryMinutes)` `cor(TotalSteps, Calories)`
## <dbl> <dbl>
## 1 -0.327 0.592
For the sleep dataframe:
sleep_day_df %>%
summarise(cor(TotalMinutesAsleep, TotalTimeInBed))
## cor(TotalMinutesAsleep, TotalTimeInBed)
## 1 0.9304575
Note: You could set “all = TRUE” to keep all the Ids intact
combined_data <- merge(sleep_day_df, dailyActivity_merged_df, by="Id", all = FALSE)
Take a look at how many participants are in this data set.
n_distinct(combined_data$Id)
## [1] 24
colnames(combined_data)
## [1] "Id" "SleepDay"
## [3] "TotalSleepRecords" "TotalMinutesAsleep"
## [5] "TotalTimeInBed" "ActivityDate"
## [7] "TotalSteps" "TotalDistance"
## [9] "TrackerDistance" "LoggedActivitiesDistance"
## [11] "VeryActiveDistance" "ModeratelyActiveDistance"
## [13] "LightActiveDistance" "SedentaryActiveDistance"
## [15] "VeryActiveMinutes" "FairlyActiveMinutes"
## [17] "LightlyActiveMinutes" "SedentaryMinutes"
## [19] "Calories"
Do participants who sleep more also take more steps or fewer steps per day?
ggplot(data = combined_data, aes(x = TotalMinutesAsleep, y = TotalSteps)) +
geom_smooth() +
geom_point(mapping = aes(color = TotalDistance))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Check if there is a numeric correlation.
combined_data %>%
summarise(cor(TotalMinutesAsleep, TotalSteps))
## cor(TotalMinutesAsleep, TotalSteps)
## 1 -0.09854146
There is no relationship between time spent asleep and number of steps taken. Hence, this product does not disrupt the regular sleep pattern of the users.
The Fitabase Data shows that:
Marketing strategy based on the analysis: