1.About Bellabeat

Bellabeat is a high-tech company that manufactures health-focused smart products such as Leaf, Time, and Spring. These smart devices can collect users’ health data suuch as steps, heartrate, sleep quality and calories burnt. This project aims to identify user trends to discover marketing strategies.

The Buisness Task

*To identify trends of smart device usage

*Applying smart device usage trends to improve Bellabeat customer experience

*Create Bellabeat marketing strategy

2. Prepare

2.1 Data

*Data source used The datasets used are from the FitBit Fitness Tracker Data. It is openly accessible by Kaggle. Link to data set: FitBit Fitness Tracker Data

2.2 Packages used

*Opening packages that will be used for data cleaning and analysis later.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)
library(dplyr)
library(ggplot2)
library(tidyr)
library(readr)

2.3 Dataset imported

**Uploading datasets that are used. In this project, we will use users’ sleeping habits, their steps recorded, sleep activity, and calories for analysis.

steps_records <- read_csv("dailySteps_merged.csv")

## Rows: 940 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, StepTotal
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

sleep_records <- read_csv("sleepDay_merged.csv")

## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Activity_all <- read_csv("dailyActivity_merged.csv")

## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Calories_all <- read_csv("dailyCalories_merged.csv")

## Rows: 940 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, Calories
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

hourlySteps <- read_csv("hourlySteps_merged.csv")

## Rows: 22099 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (2): Id, StepTotal
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

hourlyIntensities <- read_csv("hourlyIntensities_merged.csv")

## Rows: 22099 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (3): Id, TotalIntensity, AverageIntensity
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

hourlyCalories <- read_csv("hourlyCalories_merged.csv")

## Rows: 22099 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (2): Id, Calories
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2.4 Preview

*Function I used to preview the data sets.

glimpse(steps_records)

## Rows: 940
## Columns: 3
## $ Id          <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960366…
## $ ActivityDay <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/2016", "4/16/…
## $ StepTotal   <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019, 15506, 1054…

head(steps_records)

## # A tibble: 6 × 3
##           Id ActivityDay StepTotal
##        <dbl> <chr>           <dbl>
## 1 1503960366 4/12/2016       13162
## 2 1503960366 4/13/2016       10735
## 3 1503960366 4/14/2016       10460
## 4 1503960366 4/15/2016        9762
## 5 1503960366 4/16/2016       12669
## 6 1503960366 4/17/2016        9705

The glimpse function lets me to preview what attributes of data is in the data sets. This led me to notice that the date of the data sets were all in instead of date/date-time. Also, there are number of cells that contain null values. The head function allows me to see what values I am working with. I observed that the dates I am working with are from 2016, which might be out-dated. Also, there are some values that seemed unlikely that needed to be cleaned

Process

3.1 Data cleaning

*Checking for duplicates

sum(duplicated(steps_records))

## [1] 0

sum(duplicated(sleep_records))

## [1] 3

sum(duplicated(Activity_all))

## [1] 0

sum(duplicated(Calories_all))

## [1] 0

sum(duplicated(hourlySteps))

## [1] 0

sum(duplicated(hourlyIntensities))

## [1] 0

sum(duplicated(hourlyCalories))

## [1] 0

*There are duplicates in sleep_records ### Removing duplicates

sleep_records <- sleep_records %>%
  distinct() %>%
  drop_na()

Formatting and merging data

*Merging records so analysis can be done easier later> I choose not to merge all data into one file as I want to separate each data set for each analysis to make things less messy.

sleep_records <- sleep_records %>% 
  separate(SleepDay, into=c("ActivityDay", "times"), sep=' ')

## Warning: Expected 2 pieces. Additional pieces discarded in 410 rows [1, 2, 3, 4, 5, 6,
## 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

step_sleep_all <- merge(sleep_records, steps_records, by = c('ActivityDay','Id'), all=TRUE)
step_sleep_all$ActivityDay <- as.Date(step_sleep_all$ActivityDay, format="%m/%d/%Y")

step_calories <- merge(Calories_all, steps_records, by = c('ActivityDay','Id'), all=TRUE)

hourlyIntensities <- hourlyIntensities %>% 
  rename(date_time = ActivityHour) %>% 
  mutate(date_time = as.POSIXct(date_time,format ="%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone())) %>% 
  separate(date_time, into=c("ActivityDay", "times"), sep=' ')

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 934 rows [1, 25, 49, 73,
## 97, 121, 145, 169, 193, 217, 241, 265, 289, 313, 337, 361, 385, 409, 433, 457,
## ...].

hourlyCalories <- hourlyCalories %>% 
  rename(date_time = ActivityHour) %>% 
  mutate(date_time = as.POSIXct(date_time,format ="%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone())) %>% 
  separate(date_time, into=c("ActivityDay", "times"), sep=' ')

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 934 rows [1, 25, 49, 73,
## 97, 121, 145, 169, 193, 217, 241, 265, 289, 313, 337, 361, 385, 409, 433, 457,
## ...].

Calories_Intensities <- merge(hourlyIntensities,hourlyCalories, by = c('Id', 'times'), all= TRUE)

4. Data visualization and analysis

4.1 Device usage while sleeping

I have noticed that in there are numerous null or zero values in the data set where sleep is documented. I want to find out if those users are not using any products even during the day, or they don’t have their sleep analyzed. *To identified each user easier, I assigned each user to a unique number.

step_sleep_protect <- transform(step_sleep_all,
                      user = as.numeric(factor(Id)))

*I use visualization to find out how much data is collected when users go to bed.

ggplot(step_sleep_protect) +
  geom_point(aes(x=ActivityDay, y=TotalTimeInBed, colour=user)) + 
  facet_wrap(~user) +
  labs(x="Date", y="Minutes", title = "Time in bed", caption="Figure 1.")+
  theme(axis.text.x = element_text(angle = 90))

## Warning: Removed 530 rows containing missing values (`geom_point()`).

*Then, I see if those users are using devices during the day

ggplot(step_sleep_protect) +
  geom_point(aes(x=ActivityDay, y=StepTotal, colour=user)) + 
  facet_wrap(~user) +
  labs(x="Date", y="Steps in one day", title = "Steps per day for each user", caption="Figure 2.")+
  theme(axis.text.x = element_text(angle = 90))

*Comparing both charts, one can see that there are several users that uses Bellabeat devices to track their daily steps but not when they go to bed.

4.1.2 Sleeping habits

*From the data set “sleepDay_merged”, we can see two different columns titled “TotalMinutesAsleep” and “TotalTimeInBed”, which contains different values. I am curious how long until Bellabeat users actually fall asleep after they went to bed.

ggplot(step_sleep_protect)+
  geom_smooth(aes(x=ActivityDay, y=TotalTimeInBed, col = 'red'))+
  geom_smooth(aes(x=ActivityDay, y=TotalMinutesAsleep, col='blue'))+
  labs(x="Day", y="Minutes", title = "Minutes spent sleeping and in bed", caption="Figure 3.")+
  scale_color_identity( name = "Legend",
                        breaks = c ("red", "blue"),
                        labels = c("Time in bed", "Sleeping"),
                        guide = "legend")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Warning: Removed 530 rows containing non-finite values (`stat_smooth()`).

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Warning: Removed 530 rows containing non-finite values (`stat_smooth()`).

*Using visualization tools, it is shown that there is approximately a 20 minutes gap before users fall asleep everyday.

4.2 Analysing user steps data

*Average Steps per day of 33 users

step_sleep_protect %>% 
  group_by(ActivityDay) %>% 
  summarise(Steps=mean(StepTotal)) %>% 
  ggplot()+geom_smooth(aes(x=ActivityDay, y=Steps)) +
  labs(x="Date", y="Steps", title = "Average step of users per day", caption="Figure 4.")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

4.4 Total Steps per day VS Calories burnt

*It should be logical that with more steps per day, one should burn more calories. Let’s see if we can find this relationship in the Bellabeat data set.

Activity_all %>% 
  summarise(correlation=cor(TotalSteps, Calories), conf.level = 0.95)

## # A tibble: 1 × 2
##   correlation conf.level
##         <dbl>      <dbl>
## 1       0.592       0.95

*With a 95% confidence level, we can see that there is correlation between steps per day and calories with a correlation coefficient of 0.6. Let’s visualize the data.

ggplot(Activity_all)+geom_smooth(aes(x=TotalSteps, y= Calories))+
  labs(x="Total Steps", y="Calories burnt", title = "Total Steps per day VS Calories burnt", caption="Figure 5.")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

4.5 Steps during the day

hourlySteps <- hourlySteps %>% 
  rename(date_time = ActivityHour) %>% 
  mutate(date_time = as.POSIXct(date_time,format ="%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone())) %>% 
  separate(date_time, into=c("ActivityDay", "times"), sep=' ')

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 934 rows [1, 25, 49, 73,
## 97, 121, 145, 169, 193, 217, 241, 265, 289, 313, 337, 361, 385, 409, 433, 457,
## ...].

hourlySteps %>% 
  group_by(times) %>%
  summarise(stepmean=mean(StepTotal)) %>% 
  ggplot()+geom_col(aes(x=times, y=stepmean, fill=stepmean)) +
  scale_fill_gradient(low = "blue", high = "orange")+
  theme(axis.text.x = element_text(angle = 90))+
  labs(x="Time of day", y="Average steps", title = "Average steps per time of day", caption="Figure 6.")

*From Figure 6, we can see that users have their highest average steps per day at the hour of 18:00.

4.6 Steps people should take per to be healthy

*According to the Centers for Disease Control and Prevention (CDC), adults should take 10000 steps per day to maintain a good general health. Taking less than 5000 steps per day is considered a sedentary lifestyle.

step_user <- step_sleep_protect %>% 
  group_by(user) %>% 
  summarise(meanstep=mean(StepTotal))

step_type <- step_user %>%
  mutate(step_type = case_when(
    meanstep < 5000 ~ "Sedentary",
    meanstep >= 5000 & meanstep < 10000 ~ "Somewhat active", 
    meanstep >= 10000 ~ "Healthy amount of steps"
  ))

step_type_percent <- step_type %>%
  group_by(step_type) %>%
  summarise(each_type = n()) %>%
  mutate(all = sum(each_type)) %>%
  group_by(step_type) %>% 
  summarise(step_percent = each_type / all )%>%
  mutate(step_percent_graph = scales::percent(step_percent))

step_type_percent %>% 
  ggplot(aes(x="", y=step_percent, fill=step_type)) + geom_bar(stat = "identity")+
  coord_polar("y", start=0)+
  theme_minimal()+
  geom_text(aes(label = step_percent_graph),
            position = position_stack(vjust = 0.5))+
  scale_fill_manual(values = c("green3","red3", "lightblue2"))+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank())+
  labs(fill = "User catogories", title="Catogorizing users based on average steps per day", caption="Figure 7.")

*It is shown that only 21.2 percent of users have a health amount of steps per day, while 54.5 percent has between 5000 to 10000 steps per day, and 24.2 percent with less than 5000 steps per day.

4.7 Calories and intensities

Calories_Intensities %>% 
  group_by(times) %>% 
  summarise(ti=mean(TotalIntensity), c=mean(Calories)) %>% 
  ggplot()+
  geom_bar(aes(x=times, y=c), stat = "identity", fill = 'green4', alpha = 0.6)+
  geom_segment(aes(x = times, xend= times, y=0, yend = ti/0.2), stat = "identity", color = 'red',position = position_dodge(width = 0.5))+
  geom_point(aes(x=times, y=ti/0.2), color='red')+
  theme(axis.text.x = element_text(angle = 90))+
  scale_y_continuous(name = "Calories", 
                     sec.axis = sec_axis(~.*0.2, name = "Intensity"))+
  theme(axis.title.y.right = element_text(color = "red"))+
  labs(x="Time", title= "Average Intensity and Calories at different times of day", caption="Figure 8.")

It can be seen that calories are connected to intensity. However, at lower intensities, calories burnt seems to deline drastically. The highest average intensity is at 18:00, where lowest is at 03:00. The highest average intensity associates with the time where users get off work, where the lowest average intensity associates with users sleeping. Furthermore, we can see that the average active period of users are between 07:00 to 21:00, where there is a drop at 15:00. This can represent that users taking rest after having lunch.

Summary/Insights

From Figure 1 and 2, there are users that uses Bellabeat products to track their daily steps and intensities, however, they are not using them to track their sleep. One of the reasons this occur may due to those users don’t have the devices that they are wear during sleeping. Digital marketing can be done to those users to purchase other Bellabeat products that are compatible on bed. Another reason maybe because some find Bellabeat products uncomfortable to wear during sleep. Also, from the sleeping data collected, it is shown that not much aspects of sleeping data is collected. Bellabeat can design a product that is specifically to monitor sleeping patterns, such as detecting Rapid eye movement (REM).
From Figure 3, it can be seen that there is a gap between users on bed and actually users falling asleep. This reason occuurs may be due to users using their phones while on bed. When users are detected on bed, Bellabeat app can send a notification to remind users to reduce screen time for a better sleeping quality. This can increase interaction between the app and users, improving in user experience.
From Figure 4, it is shown that the average steps of users are declining. This suggests that there are Bellabeat users using the products less often. To increase their usage, Bellabeats can create a scoring system, where when users records their data, points are rewarded, and those points can be use to redeem discounts for Bellabeat products.
From Figure 7, it can be seen that more than 70 percent of users are not having their recommended daily steps per day. Using the average steps per hour, Bellabeat app use notifications that recommend users to take more steps to complete the suggested 10000 steps per day.

Bella Beats user data analysis project

2023-09-01