This case study serves as an output of my data analysis on “Bellabeat” which is based on Capstone Project: Case Study 2 provided by Google Data Analytics Course. In this study, the goal is to gain insights on how consumers are using their smart devices. The insights will be used to strategize the marketing of Bellabeat products.

The process of data analysis follows the following steps: Ask, Prepare, Process, Analyze, and Share.

Ask

Three questions needed to be answered for this case study are:

The key tasks are:

Prepare

The data was sourced from Mobius and is publicly available in Kaggle (click here). It consists of personal fitness tracker from 30 Fitbit users. These users consented to give their data on minute level output for physical activity, heart rate, and sleep monitoring. The data also includes information about daily activity, steps, and heart rate that can be used to explore users’ habits. The inclusive dates of the data were 4/12/2016 until 5/12/2016. The datasets were downloaded as csv files and were imported in RStudio for processing.

The limitations of the data are:

To prepare the dataset, specific tidyverse packages will be initially installed and the datasets for daily activities, sleep, and weight information will be imported. Hourly dataset for calories, intensities, and steps will also be used. This means that the insight for this data analysis will be focused on the daily and hourly trend.

# Installing packages

install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("janitor")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("readr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("ggpubr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library("tidyverse")
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library("dplyr")
library("janitor")
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library("skimr")
library("lubridate")
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library("ggplot2")
library("ggpubr")
library("readr")
setwd("/cloud/project/R_Capstone_Bellabeat/Fitbase_Data")

daily_activity <- read_csv("dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
hourly_calories <- read_csv("hourlyCalories_merged.csv")
## Rows: 22099 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (2): Id, Calories
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
hourly_intensities <- read_csv("hourlyIntensities_merged.csv")
## Rows: 22099 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (3): Id, TotalIntensity, AverageIntensity
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
hourly_steps <- read_csv("hourlySteps_merged.csv")
## Rows: 22099 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (2): Id, StepTotal
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
seconds_heartrate <- read_csv("heartrate_seconds_merged.csv")
## Rows: 2483658 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Time
## dbl (2): Id, Value
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sleep_day <- read_csv("sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
weight_info <- read_csv("weightLogInfo_merged.csv")
## Rows: 67 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
## lgl (1): IsManualReport
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Process

Cleaning the data

The dataset is checked for alignment of number of observations to the samples, correct data types, and existence of null values. Checking the daily activities data, we have

# Checking data for daily activity

head(daily_activity)
skim_without_charts(daily_activity)
glimpse(daily_activity)

For sleep_day data,

# Checking data for daily sleep

head(sleep_day)
skim_without_charts(sleep_day)
glimpse(sleep_day)

Then, for seconds_heartrate data,

# Checking data for heart rate

head(seconds_heartrate)
skim_without_charts(seconds_heartrate)
glimpse(seconds_heartrate)

Checking the hourly data,

# Checking data for hourly calories, intensities, and steps

head(hourly_calories)
skim_without_charts(hourly_calories)
glimpse(hourly_calories)

head(hourly_intensities)
skim_without_charts(hourly_intensities)
glimpse(hourly_intensities)

head(hourly_steps)
skim_without_charts(hourly_steps)
glimpse(hourly_steps)

In this case, no null values are found in the dataset and the data types have been identified. The number of observations will be verified this time.

# Looking for duplicates

n_distinct(daily_activity$Id)
## [1] 33
n_distinct(hourly_calories$Id)
## [1] 33
n_distinct(hourly_intensities$Id)
## [1] 33
n_distinct(hourly_steps$Id)
## [1] 33
n_distinct(sleep_day$Id)
## [1] 24
n_distinct(seconds_heartrate$Id)
## [1] 14
n_distinct(weight_info$Id)
## [1] 8

The seconds_heartrate and weight_info contains 14 and 8 unique user ids, respectively. To lessen the inaccuracies of the trends to be identified, they will be excluded in the data analysis process. Additionally, the data type for dates is string. So, dates will be formatted into date data type which will be added to the daily activity data.

# Cleaning data

user_id <- as.character(daily_activity$Id)

daily_activity_date <- as.Date(daily_activity$ActivityDate, format = "%m/%d/%y")

new_daily_activity <- daily_activity %>% 
  mutate(formatted_date = daily_activity_date, weekday_date = weekdays(daily_activity_date))

new_sleep_day <- separate(sleep_day, SleepDay, into = c("Sleep_Date","Sleep_Time"), sep = " ") %>%
  mutate(formatted_date = as.Date(Sleep_Date, format = "%m/%d/%y"))
## Warning: Expected 2 pieces. Additional pieces discarded in 413 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
hourly_calories_hour <- mdy_hms(hourly_calories$ActivityHour)
new_hourly_calories <- hourly_calories %>% 
  mutate(formatted_hour = format(as.POSIXct(hourly_calories_hour),"%H:%M:%S"))

hourly_intensities_hour <- mdy_hms(hourly_intensities$ActivityHour)
new_hourly_intensities <- hourly_intensities %>% 
  mutate(formatted_hour = format(as.POSIXct(hourly_intensities_hour),"%H:%M:%S"))

hourly_steps_hour <- mdy_hms(hourly_steps$ActivityHour)
new_hourly_steps <- hourly_steps %>% 
  mutate(formatted_hour = format(as.POSIXct(hourly_steps_hour),"%H:%M:%S"))

Analyze and Share

Summarizing the data

After cleaning the data, the focus for data analysis will be set. In this case, we want to know the trends using the data for user ids, total steps, total calories, and active minutes. The first step is to group the data by user ids and determine the mean steps.

# Data summary for average calories burned daily per person

daily_activity_per_user <- daily_activity %>% group_by(Id) %>% 
  summarize(mean_steps = mean(TotalSteps), mean_calorie = mean(Calories))

Then, we group the average steps and calories produced by the sample per day in a week.

# Data summary for average steps taken daily per person

daily_activity_per_day <- new_daily_activity %>% group_by(weekday_date) %>% 
  summarize(mean_steps = mean(TotalSteps), mean_calories = mean(Calories))

Next, we determine the mean sleep minutes per day for every user and mean sleep minutes per week.

mean_sleep_per_day <- new_sleep_day %>% group_by(Id) %>%
  summarize(mean_sleep = mean(TotalMinutesAsleep), mean_totaltimeinbed = mean(TotalTimeInBed))

mean_sleep_per_week <- new_sleep_day %>% mutate(formatted_date = as.Date(Sleep_Date, format = "%m/%d/%y")) %>% mutate(Weekday_Sleep = weekdays(formatted_date)) %>% group_by(Weekday_Sleep) %>%
  summarize(mean_sleep = mean(TotalMinutesAsleep), mean_TotalTimeInBed = mean(TotalTimeInBed), mean_sleep_record = mean(TotalSleepRecords))

After that, a merged table that includes time and the following mean variables which are calories, intensities, and steps will be produced for analysis later.

# Data summary for hourly calories, intensities and steps

to_calories_per_hour <- new_hourly_calories %>% group_by(formatted_hour) %>% 
  summarize(mean_calories = mean(Calories))

to_intensities_per_hour <- new_hourly_intensities %>% group_by(formatted_hour) %>% 
  summarize(mean_intensity = mean(TotalIntensity))

to_steps_per_hour <- new_hourly_steps %>% group_by(formatted_hour) %>% 
  summarize(mean_steps = round(mean(StepTotal),0))

The first trend to see is the calories burned and daily step taken.

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

Based on the figure, the smoothing line shows moderate correlation between calories and total steps taken. Next, we’ll check the days the user would produce specific amount of steps. Users with long active minutes distinctively burns high amounts of calories. This is also the same case for users with short sedentary minutes.

Highest average steps taken were taken on Thursday but significantly high amounts of steps were taken on weekends which are Saturday and Sunday.

Figuring out the trend of calories burned per day in a week

The third figure shows the trend for burned calories per day. Thursday and Sunday are the top days for burned calories. Note that Thursday is also the top day for average steps taken. This is consistent with the correlation between burned calories and steps taken.

The next figures will show trends for hourly calories, intensities, and sleep.

## Warning in geom_histogram(stat = "identity", fill = "purple"): Ignoring unknown
## parameters: `binwidth`, `bins`, and `pad`

Accordingly, most calories are burned during 5 to 6 in the afternoon. For the trend in intensity,

## Warning in geom_histogram(stat = "identity", fill = "dark seagreen4"): Ignoring
## unknown parameters: `binwidth`, `bins`, and `pad`

Similar with the previous figure, intensity is greatest during 5 to 6 in the afternoon. For the steps taken in a day,

## Warning in geom_histogram(stat = "identity", fill = "cyan4"): Ignoring unknown
## parameters: `binwidth`, `bins`, and `pad`

The figure also resembles the trend in the previous figure where steps taken are at greatest during 5 to 6 in the afternoon.

The next part shows the frequency distribution of mean minutes of sleep per day.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    61.0   336.3   419.1   377.6   449.3   652.0

The distribution plot for mean minutes of sleep per day is skewed to the left with mean sleep minutes of 377.6. Most frequent sleep minutes ranges from approximately 340-450. Outliers are present in the distribution. For the trend of minutes of sleep per week,

The bar plot shows increasing amount of sleep minutes starting on Tuesday and peaks at Friday. Then, it begins to decrease until Sunday and finally peaking again on Monday. Overall, users tend to sleep for approximately more than 7 hours per day.

Findings

  • Calories burned is moderately correlated to amount of steps taken by users.
  • High amounts of steps and burned calories are produced mostly during weekends but it also significantly lies on Thursday.
  • Per day, average calories, intense activity, and steps are significantly high from 5 to 6 in the afternoon.
  • Overall, users tend to sleep for approximately more than 7 hours per day with longest sleep minutes on Friday and Monday.

Insights

  • Increased activity induces burning of calories which is significant for users to take note.
  • Users are most probably working professionals since they spend taking steps or do intense activities during weekends. Intuitively, their free time on weekdays to be active is on Thursday.
  • Statistically, intense activities are done late in the afternoon since the users are assumingly working during the day. This is supported by an increased burned calories, steps taken during that time and the amount of sleep.
  • Most users tend to have the highest amount of sleep minutes at the beginning and end of the weekdays. However, the average sleep minutes of most users is less than the recommended amount which is 8 hours.

Recommendations