This case study serves as an output of my data analysis on “Bellabeat” which is based on Capstone Project: Case Study 2 provided by Google Data Analytics Course. In this study, the goal is to gain insights on how consumers are using their smart devices. The insights will be used to strategize the marketing of Bellabeat products.
The process of data analysis follows the following steps: Ask, Prepare, Process, Analyze, and Share.
Three questions needed to be answered for this case study are:
The key tasks are:
The data was sourced from Mobius and is publicly available in Kaggle (click here). It consists of personal fitness tracker from 30 Fitbit users. These users consented to give their data on minute level output for physical activity, heart rate, and sleep monitoring. The data also includes information about daily activity, steps, and heart rate that can be used to explore users’ habits. The inclusive dates of the data were 4/12/2016 until 5/12/2016. The datasets were downloaded as csv files and were imported in RStudio for processing.
The limitations of the data are:
To prepare the dataset, specific tidyverse packages will be initially installed and the datasets for daily activities, sleep, and weight information will be imported. Hourly dataset for calories, intensities, and steps will also be used. This means that the insight for this data analysis will be focused on the daily and hourly trend.
# Installing packages
install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("janitor")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("readr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library("tidyverse")
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library("dplyr")
library("janitor")
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library("skimr")
library("lubridate")
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library("ggplot2")
library("readr")
setwd("/cloud/project/R_Capstone_Bellabeat/Fitbase_Data")
daily_activity <- read_csv("dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
hourly_calories <- read_csv("hourlyCalories_merged.csv")
## Rows: 22099 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (2): Id, Calories
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
hourly_intensities <- read_csv("hourlyIntensities_merged.csv")
## Rows: 22099 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (3): Id, TotalIntensity, AverageIntensity
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
hourly_steps <- read_csv("hourlySteps_merged.csv")
## Rows: 22099 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (2): Id, StepTotal
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
seconds_heartrate <- read_csv("heartrate_seconds_merged.csv")
## Rows: 2483658 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Time
## dbl (2): Id, Value
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sleep_day <- read_csv("sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
weight_info <- read_csv("weightLogInfo_merged.csv")
## Rows: 67 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
## lgl (1): IsManualReport
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The dataset is checked for alignment of number of observations to the samples, correct data types, and existence of null values. Checking the daily activities data, we have
# Checking data for daily activity
head(daily_activity)
skim_without_charts(daily_activity)
glimpse(daily_activity)
For sleep_day data,
# Checking data for daily sleep
head(sleep_day)
skim_without_charts(sleep_day)
glimpse(sleep_day)
Then, for seconds_heartrate data,
# Checking data for heart rate
head(seconds_heartrate)
skim_without_charts(seconds_heartrate)
glimpse(seconds_heartrate)
Checking the hourly data,
# Checking data for hourly calories, intensities, and steps
head(hourly_calories)
skim_without_charts(hourly_calories)
glimpse(hourly_calories)
head(hourly_intensities)
skim_without_charts(hourly_intensities)
glimpse(hourly_intensities)
head(hourly_steps)
skim_without_charts(hourly_steps)
glimpse(hourly_steps)
In this case, no null values are found in the dataset and the data types have been identified. The number of observations will be verified this time.
# Looking for duplicates
n_distinct(daily_activity$Id)
## [1] 33
n_distinct(hourly_calories$Id)
## [1] 33
n_distinct(hourly_intensities$Id)
## [1] 33
n_distinct(hourly_steps$Id)
## [1] 33
n_distinct(sleep_day$Id)
## [1] 24
n_distinct(seconds_heartrate$Id)
## [1] 14
n_distinct(weight_info$Id)
## [1] 8
The seconds_heartrate and weight_info contains 14 and 8 unique user ids, respectively. To lessen the inaccuracies of the trends to be identified, they will be excluded in the data analysis process. Additionally, the data type for dates is string. So, dates will be formatted into date data type which will be added to the daily activity data.
# Cleaning data
user_id <- as.character(daily_activity$Id)
daily_activity_date <- as.Date(daily_activity$ActivityDate, format = "%m/%d/%y")
new_daily_activity <- daily_activity %>%
mutate(formatted_date = daily_activity_date, weekday_date = weekdays(daily_activity_date))
new_sleep_day <- separate(sleep_day, SleepDay, into = c("Sleep_Date","Sleep_Time"), sep = " ") %>%
mutate(formatted_date = as.Date(Sleep_Date, format = "%m/%d/%y"))
## Warning: Expected 2 pieces. Additional pieces discarded in 413 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
hourly_calories_hour <- mdy_hms(hourly_calories$ActivityHour)
new_hourly_calories <- hourly_calories %>%
mutate(formatted_hour = format(as.POSIXct(hourly_calories_hour),"%H:%M:%S"))
hourly_intensities_hour <- mdy_hms(hourly_intensities$ActivityHour)
new_hourly_intensities <- hourly_intensities %>%
mutate(formatted_hour = format(as.POSIXct(hourly_intensities_hour),"%H:%M:%S"))
hourly_steps_hour <- mdy_hms(hourly_steps$ActivityHour)
new_hourly_steps <- hourly_steps %>%
mutate(formatted_hour = format(as.POSIXct(hourly_steps_hour),"%H:%M:%S"))