Case Study 2: How can a wellness technology company play it smart?

Background:

This case study is part of the google data analytics certification capstone project. In this case study we are working with a fictional a fitness company called Bellabeat. Bellabeat has a variety of fitness products including,

A fitness app
Fitness tracker
Fitness smart watch
Smart water bottle
Paid membership

Scenario

You are a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. You have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights you discover will then help guide marketing strategy for the company. You will present your analysis to the Bellabeat executive team along with your high-level recommendations for Bellabeat’s marketing strategy.

Business Problem

Focus on one of Bellabeat’s products and gain insight into how consumers are using their smart devices. In other words, find consumer behavior patterns while using smart devices to monitor their health.

Methodology

R was used for the analysis of this case study. The other options were Mysql, Excel and Tableau. I chose R because it can deal with all the aspects of this case study. It has strong.

Data cleaning capabilities
Visualization prowess
Analytics
Publishing

It is an all inclusive tool whereas with other tools mentioned, I would have had to use them in combination.

About the Data

The FitBit fitness data will be used in this case study. The data is made available by Mobius through the Kaggle website. The data can be found here. It contains a total of 18 CSV files.

The Metadata for this data can be found at the fitabase Data Dictionary

Limitations

The data for this case study has 33 unique participants spans over 1 month from 12th April, 2016 to 12th May, 2016. This data is limiting because we cannot determine any seasonal trends. For example this data is from April to March which is summer season. We can’t tell from this data whether the same trends found here will continue for other seasons. There is also no information about the location of these participants hence, we cannot know what kind of temperatures we are dealing with as that information is vital when analyzing fitness. Additionally, there is no information about the gender and age of the participants. This information is important because fitness data varies greatly by gender and age.

Structure

I will use the standard data analytics approach which includes the following steps.

Ask
Prepare
Process
Analyze
Share

Ask.

We have been asked to analyze how users use their fitness devices from the fitabase data. Although Bellabeat offers a variety of products listed above, we will focus on the fitness smartphone application. The fitness smartphone application lies at the core of all other products. It is the most important piece of the puzzle because it makes sense of the data that is collected through other products like the fitness watch, tracker and bottle. The fitness application is where users will spend most of their time at by looking at their progress and trends. This is why all the recommendations given based on the data will be for the fitness application.

Prepare

As explained earlier, we have been given access to fitbit data of 30 applicants over a period of 30 days.

There are 18 CSV files. Let’s load them into R.

First, we need to import all the libraries that we will use for this analysis.

Importing Libraries

library(tidyverse)
library(here)
library(skimr)
library(janitor)
library(DescTools)
library(kableExtra)
library(scales)
library(cowplot)

getwd()

## [1] "/Users/apple/Documents/R/CS_2_Data_4.12.16-5.12.16"

setwd("/Users/apple/Documents/R/CS_2_Data_4.12.16-5.12.16")

daily_activity <- read_csv("dailyActivity_merged.csv")

## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data summary
Name	daily_activity
Number of rows	940
Number of columns	15
_______________________
Column type frequency:
character	1
numeric	14
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
ActivityDate	0	1	8	9	0	31	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100
Id	1	4.855407e+09	2.424805e+09	1503960366	2.320127e+09	4.445115e+09	6.962181e+09	8.877689e+09
TotalSteps	1	7.637910e+03	5.087150e+03	0	3.789750e+03	7.405500e+03	1.072700e+04	3.601900e+04
TotalDistance	1	5.490000e+00	3.920000e+00	0	2.620000e+00	5.240000e+00	7.710000e+00	2.803000e+01
TrackerDistance	1	5.480000e+00	3.910000e+00	0	2.620000e+00	5.240000e+00	7.710000e+00	2.803000e+01
LoggedActivitiesDistance	1	1.100000e-01	6.200000e-01	0	0.000000e+00	0.000000e+00	0.000000e+00	4.940000e+00
VeryActiveDistance	1	1.500000e+00	2.660000e+00	0	0.000000e+00	2.100000e-01	2.050000e+00	2.192000e+01
ModeratelyActiveDistance	1	5.700000e-01	8.800000e-01	0	0.000000e+00	2.400000e-01	8.000000e-01	6.480000e+00
LightActiveDistance	1	3.340000e+00	2.040000e+00	0	1.950000e+00	3.360000e+00	4.780000e+00	1.071000e+01
SedentaryActiveDistance	1	0.000000e+00	1.000000e-02	0	0.000000e+00	0.000000e+00	0.000000e+00	1.100000e-01
VeryActiveMinutes	1	2.116000e+01	3.284000e+01	0	0.000000e+00	4.000000e+00	3.200000e+01	2.100000e+02
FairlyActiveMinutes	1	1.356000e+01	1.999000e+01	0	0.000000e+00	6.000000e+00	1.900000e+01	1.430000e+02
LightlyActiveMinutes	1	1.928100e+02	1.091700e+02	0	1.270000e+02	1.990000e+02	2.640000e+02	5.180000e+02
SedentaryMinutes	1	9.912100e+02	3.012700e+02	0	7.297500e+02	1.057500e+03	1.229500e+03	1.440000e+03
Calories	1	2.303610e+03	7.181700e+02	0	1.828500e+03	2.134000e+03	2.793250e+03	4.900000e+03

## spc_tbl_ [940 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id                      : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : num [1:940] 13162 10735 10460 9762 12669 ...
##  $ TotalDistance           : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : num [1:940] 728 776 1218 726 773 ...
##  $ Calories                : num [1:940] 1985 1797 1776 1745 1863 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   ActivityDate = col_character(),
##   ..   TotalSteps = col_double(),
##   ..   TotalDistance = col_double(),
##   ..   TrackerDistance = col_double(),
##   ..   LoggedActivitiesDistance = col_double(),
##   ..   VeryActiveDistance = col_double(),
##   ..   ModeratelyActiveDistance = col_double(),
##   ..   LightActiveDistance = col_double(),
##   ..   SedentaryActiveDistance = col_double(),
##   ..   VeryActiveMinutes = col_double(),
##   ..   FairlyActiveMinutes = col_double(),
##   ..   LightlyActiveMinutes = col_double(),
##   ..   SedentaryMinutes = col_double(),
##   ..   Calories = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

## [1] "Minimum activity date is 4/12/2016"

## [1] "Maximum activity date is 5/9/2016"

Process

We can see that the ActivityDate variable type is character. This should be a date. We need to fix this and convert it into date.

daily_activity <- daily_activity %>% 
  mutate(ActivityDate=as.Date(ActivityDate, format = "%m/%d/%Y"))

Analyze

Making day and week variables

daily_activity <- daily_activity %>% 
  mutate(WeekDay=wday(ActivityDate,label=TRUE)) %>% 
  mutate(Week=week(ActivityDate))

# Weeks are in yearly format. I want them to start from 1. We will mutate a new variable for it
daily_activity <- daily_activity %>% 
  mutate(Week_ord=case_when(
    Week == 15 ~ 1,
    Week == 16 ~ 2,
    Week == 17 ~ 3,
    Week == 18 ~ 4,
    Week == 19 ~ 5
    ))

This data will be much more useful once we aggregate it with means.

aggregate_df <- daily_activity %>% 
  group_by(ActivityDate) %>% 
  summarize(across( c(TotalSteps,TotalDistance,TrackerDistance,
                     LoggedActivitiesDistance,VeryActiveDistance,
                     ModeratelyActiveDistance,LightActiveDistance,
                     SedentaryActiveDistance,VeryActiveMinutes,
                     FairlyActiveMinutes,SedentaryMinutes,
                     Calories),
            ~ round(mean(.),2),
            .names="mean_{.col}"))

One of the ways the users of fitbit use their watch is by comparing their daily progress over time. A user for example can check their steps over the month.

user <- daily_activity %>% 
  filter(Id==sample(Id,1))

Here, we have randomly selected a user from our data. Now let’s compare their steps over time.

ggplot(user, aes(x=ActivityDate,y=TotalSteps)) + 
  geom_col(fill="goldenrod3") +
  theme_minimal() + 
  scale_y_continuous(labels = comma)

From this graph, a user can easily see their progress over time and make adjustnents to their fitness routine.

Another way to check progress is to compare your performance with the rest of the population. A user can compare their progress to other people that are also using the same fitness tracker and have the same age, gender and weight etc.

In our data, we do not have that liberty. Since we don’t have the information for age, gender and weight we will make an assumption that all users in our data have the same profile for age, gender and weight.

Let’s compare 1 user’s data to that of the population. The population in this case is the total observations in the data set i.e. 30. We will take the mean of the population and compare it with the user. We will select the user and the date on random.

# Let's make a population and sample data frame so we can make comparison graphs
random_Id <- sample(daily_activity$Id,1)
random_date <- sample(daily_activity$ActivityDate,1)

# Make an aggregate data frame for the population
aggregate_pop <- daily_activity %>% 
  filter(Id!=random_Id & ActivityDate==random_date) %>% 
  summarize(across( c(TotalSteps,TotalDistance,TrackerDistance,
                      LoggedActivitiesDistance,VeryActiveDistance,
                      ModeratelyActiveDistance,LightActiveDistance,
                      SedentaryActiveDistance,VeryActiveMinutes,
                      FairlyActiveMinutes,SedentaryMinutes,
                      Calories),
                    ~ round(mean(.),2),
                    .names="{.col}")) %>% 
  mutate(type="population") %>% 
  select(type,TotalSteps,TotalDistance,TrackerDistance,
         LoggedActivitiesDistance,VeryActiveDistance,
         ModeratelyActiveDistance,LightActiveDistance,
         SedentaryActiveDistance,VeryActiveMinutes,
         FairlyActiveMinutes,SedentaryMinutes,
         Calories)
  
# Make an aggregate data frame for sample
aggregate_sample <- daily_activity %>% 
  filter(Id==random_Id & ActivityDate==random_date) %>% 
  select(-Id,-ActivityDate) %>% 
  mutate(type="sample") %>% 
  select(type,TotalSteps,TotalDistance,TrackerDistance,
         LoggedActivitiesDistance,VeryActiveDistance,
         ModeratelyActiveDistance,LightActiveDistance,
         SedentaryActiveDistance,VeryActiveMinutes,
         FairlyActiveMinutes,SedentaryMinutes,
         Calories)

# Now we need to combine the two data frames into one
agg_sam_pop <- rbind(aggregate_pop,aggregate_sample)

Now we are ready to compare the user’s data to population’s average.

ggplot(agg_sam_pop,aes(x=type,y=TotalSteps,fill=type)) +
  geom_col() + 
  theme_minimal() +
  theme(panel.grid = element_blank()) + 
  geom_text(aes(label=paste0(comma(TotalSteps)," steps")), vjust=-0.5) + 
  labs(x="Type",y="Steps",
       title = "Comparing a random user's steps to population's \nmean steps") +
  scale_fill_manual(values = c("population"="darkgray","sample"="brown2")) +
  scale_y_continuous(labels = comma) + 
  theme(legend.position = "none")

From this graph, the user can easily see how they are performing compared to other people. This kind of a comparison can motivate the user and also give them confidence about their fitness. Another addition that could improve this information is by adding recommended values that are determined based on some specific metrics. I will not show that here because I don’t know how to decide those metrics.

Similarly, we can use this method for other variables as well.

ggplot(agg_sam_pop,aes(x=type,y=TotalDistance,fill=type)) +
  geom_col() + 
  theme_minimal() +
  theme(panel.grid = element_blank()) + 
  geom_text(aes(label=paste0(comma(TotalDistance)," Miles")), vjust=-0.5) + 
  labs(x="Category",y="Distance",
       title = "Comparing a random user's distance travelled \nto population's average distance") +
  scale_fill_manual(values = c("population"="darkgray","sample"="brown2")) +
  scale_y_continuous(labels = comma) + 
  theme(legend.position = "none")

ggplot(agg_sam_pop,aes(x=type,y=Calories,fill=type)) +
  geom_col() + 
  theme_minimal() +
  theme(panel.grid = element_blank()) + 
  geom_text(aes(label=paste0(comma(Calories)," calories")), vjust=-0.5) + 
  labs(x="Type",y="Calories Burned",
       title = "Comparing a random user's calories burned \nto population's mean ") +
  scale_fill_manual(values = c("population"="darkgray","sample"="brown2")) +
  scale_y_continuous(labels = comma) + 
  theme(legend.position = "none")

We can do the same comparison for different activity minutes.

minute_activity <- agg_sam_pop %>% 
  select(type,VeryActiveMinutes,FairlyActiveMinutes,SedentaryMinutes) %>% 
  pivot_longer(
    cols = c(VeryActiveMinutes,FairlyActiveMinutes,SedentaryMinutes),
    values_to = "Minutes",
    names_to = "Activity"
  ) 

ggplot(minute_activity, aes(x=type, y=Minutes, fill=Activity)) +
  geom_col(width=0.6) +
  theme_minimal() +
  labs(x="Type",y="Active Minutes",
       title = "Comparing a random user's calories burned \nto population's mean ") +
  scale_fill_manual(values = c("FairlyActiveMinutes"="mediumpurple3",
                               "SedentaryMinutes"="lightblue2",
                               "VeryActiveMinutes"="dodgerblue3")) +
  scale_y_continuous(labels = comma,breaks = seq(0,1300, by=200))

With this graph, the user can compare their activity types with the population to see how they are performing.

In the same way, a user can check their patterns

Moving forward, we will work with the heart rate data frame

heart_rate <- read_csv("heartrate_seconds_merged.csv")

## Rows: 2483658 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Time
## dbl (2): Id, Value
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The heart rate CSV contains heart rate data by seconds. Heart rate is measured after every 5 seconds, resulting in a staggering 2.4 million observations in this data frame.

The Time variable is also a character in the heart rate data frame. We will convert it into a dattetime variable.

heart_rate <- heart_rate %>% 
  mutate(Time=as_datetime(Time, format = "%m/%d/%Y %I:%M:%S %p")) %>% 
  mutate(hour=hour(Time))

We will need to aggregate the heart rate data to get some observations out of it. First, let’s check out the average heart rate by hour.

hr_sum <- heart_rate %>% 
  group_by(hour) %>% 
  summarise(mean_hr=round(mean(Value),2))

ggplot(hr_sum,aes(x=hour,y=mean_hr)) +
  geom_smooth(method="gam",formula = y ~ s(x),se=FALSE) + 
  theme_minimal() + 
  geom_line(alpha=0.6) +
  scale_x_continuous(breaks=seq(0, 23, by = 1)) +
  scale_y_continuous(breaks=seq(60, 100, by =5 )) + 
  labs(y="Mean Heart rate",x="Hours (24 hr time)",title="Mean Heart rate by Hour") +
  annotate("rect",xmin = 16,xmax = 19,ymin = 60,ymax = 90,fill="red",alpha=0.3)

From this graph, we can see that heart rate is low at night during sleep hours, and gradually increases throughout the day and then goes down again at night after 7:00 pm.

Now let’s move on to how a user would benefit from this information. A user can compare their heart rate information with other users.

unique(heart_rate$Id)

##  [1] 2022484408 2026352035 2347167796 4020332650 4388161847 4558609924
##  [7] 5553957443 5577150313 6117666160 6775888955 6962181067 7007744171
## [13] 8792009665 8877689391

However, we cannot use the same random user from before because we can see that the heart rate data does not include the data of all the original 30 participants. The Heart Rate data only has data of 14 participants. This could be because not all the participants devices are capable of recording heart rate data.

Because of this, we will need to create a new random user for this analysis.

hr_user_id <- sample(heart_rate$Id,1)

# Creating a data frame for user's data
hr_user <- heart_rate %>% 
  filter(Id==hr_user_id) %>% 
  group_by(hour) %>% 
  summarize(mean_hr=mean(Value)) %>% 
  mutate(category="user") %>% 
  select(category,hour,mean_hr)

# Creating a data frame for the population data 
hr_pop <- heart_rate %>% 
  filter(Id!=random_Id) %>% 
  group_by(hour) %>% 
  summarize(mean_hr=mean(Value)) %>% 
  mutate(category="population") %>% 
  select(category,hour,mean_hr)

# Combining both data frames
hr_comp <- rbind(hr_user,hr_pop)

hr_comp <- hr_comp %>% 
  arrange(-mean_hr)

# we don't need to convert this to long format
ggplot(hr_comp, aes(x=hour,y=mean_hr,color=category)) +
  geom_smooth(method="loess",formula = y~x,se=FALSE) +
  scale_x_continuous(breaks = seq(0,23 ,by=2)) +
  scale_color_manual(values = c("population"="blue4","user"="red1")) +
  labs(x="Hours",y="Mean Heart Rate",title = "Comparing the mean heart rate of a user and population") +
  theme_minimal()

From this graph, a user can decipher whether their heart rate is lower or higher than the average of the population over a month. We could also provide a recommended range of heart rate values for their profile based on medical research.

Next, we will import the calories data and see if we can find any relations with the heart rate data.

calories_min <- read_csv("minuteCaloriesNarrow_merged.csv")

## Rows: 1325580 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityMinute
## dbl (2): Id, Calories
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We want to merge the calories data frame with the heart rate data frame. Note that the calories data frame contains values after every minute while the heart rate data frame contains heart rates after every 5 seconds. I we merge them, there will be a lot of NA values which we will drop.

First, we need to clean the Calories data so that we can merge it.

# Before merging lets correct the time variable
calories_min <- calories_min %>% 
  rename(ActivityHour=ActivityMinute)

calories_min <-  calories_min %>% 
  mutate(ActivityHour=as_datetime(ActivityHour, format = "%m/%d/%Y %I:%M:%S %p"))

heart_rate2 <- heart_rate %>% 
  rename(HeartRate=Value) %>% 
  rename(ActivityHour=Time) %>% 
  select(-hour)

Merging heart rate and calories

# Merging the data 
merged_hr_cal_min <- full_join(heart_rate2,calories_min, by = c("Id","ActivityHour"))

merged_hr_cal_min <- merged_hr_cal_min %>% 
  na.omit()

#Check for the correlation
cor(merged_hr_cal_min$HeartRate,merged_hr_cal_min$Calories)

## [1] 0.7151299

ggplot(merged_hr_cal_min,aes(x=HeartRate,y=Calories)) +
  geom_jitter(colour="red",alpha=0.2) +
  geom_smooth(method = "lm") + 
  theme_minimal() +
  labs(x="Heart Rate",y="Calories Burned",
       title = "Correlation between Heart Rate and Calories burned",
       subtitle = "By minutes") +
  geom_text(x=75,y=15,label="A correlation of 0.71 exists",size=3)

## `geom_smooth()` using formula = 'y ~ x'

From this graph, we can see that there is a positive correlation between heart rate and calories burned. This is consistent with medical research. Users can also check their own correlation between heart rate and calories burned as every person will have a different rate of burning calories

User’s correlation between Heart Rate and Calories burned

# Let's check for the user.
user_cal_hr <- sample(merged_hr_cal_min$Id,1)

merged_hr_cal_user <- merged_hr_cal_min %>% 
  filter(Id==user_cal_hr) 

cor(merged_hr_cal_user$HeartRate,merged_hr_cal_user$Calories)

## [1] 0.9040238

ggplot(merged_hr_cal_user,aes(x=HeartRate,y=Calories)) + 
  geom_jitter(alpha=0.2,colour="red") +
  geom_smooth(method = "lm",se=FALSE) +
  theme_minimal() +
  labs(x="Heart Rate", y="Calories Burned",title = " User's Correlation between Heart rate and calories burned")

## `geom_smooth()` using formula = 'y ~ x'

A user can check their personalized correlation between Heart Rate and Calories burned through this graph and adjust their activities accordingly.

From this data, we can also check how heart rate and calories burned changed over time. Because the data is large, we cannot check for every day as the graph will become too convoluted. We cannot also average it as averaged values will not be of any use.

In that case, we will check for the first day of the study when the participants started wearing these smart devices and the last day of the study to see how they have changed their habbits over the course of this study.

# Extract the dates
merged_hr_cal_min <- merged_hr_cal_min %>% 
  mutate(date=as_date(ActivityHour)) %>% 
  mutate(hour=hour(ActivityHour)) %>% 
  mutate(date_hour = format(ActivityHour, "%m/%d/%Y %H")) %>% 
  mutate(week=week(ActivityHour))

facet <- merged_hr_cal_min %>% 
  filter(date==min(date) | date==as.Date("2016-05-11"))

# There is a scaling issue between heart rate and calories burned. Because heart rate values tend to be between 50-150 but calories start from 0 and don't go much higher.

# To fix that issue, we will convert their values to percentages.

# Fixing the scaling issue 

facet <- facet %>%
  mutate(per_HeartRate = ((HeartRate - min(HeartRate)) / (max(HeartRate) - min(HeartRate))) * 100,
         per_Calories = ((Calories - min(Calories)) / (max(Calories) - min(Calories))) * 100)

# Let's change the data into long format
facet <- pivot_longer(
  data = facet,
  cols = c(per_HeartRate,per_Calories),
  names_to = "Variable",
  values_to = "Percentage"
)
# Now let's graph it
ggplot(facet,aes(x=hour,y=Percentage,fill=Variable)) +
  geom_area(position = "stack") +
  labs(title = "Heart Rate and Calories Over Hours on the first and last day",
       subtitle = "2016-05-12",
       x = "Hour", y = "Percentage") +
  scale_fill_manual(values = c("per_HeartRate" = "blue", "per_Calories" = "red")) +
  scale_x_continuous(breaks = seq(0,23, by=2)) +
  theme_minimal() +
  theme(legend.text = element_text(size=8)) +
  theme(legend.title = element_text(size = 10)) +
  theme(legend.position = "top") +
  theme(axis.text.x = element_text(size = 8)) +
  facet_wrap(vars(date))

From this graph, we can see how users have reacted to the data that they have been receiving over the month. We can see that there is less activity at night on the last day compared to the first day.

Sleep data set

Importing the sleep CSV

sleep <- read_csv("sleepDay_merged.csv")

## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Correcting the date variable
sleep <- sleep %>% 
  mutate(SleepDay=as_datetime(SleepDay,format = "%m/%d/%Y %I:%M:%S %p"))

sleep %>% 
  count(Id)

## # A tibble: 24 × 2
##            Id     n
##         <dbl> <int>
##  1 1503960366    25
##  2 1644430081     4
##  3 1844505072     3
##  4 1927972279     5
##  5 2026352035    28
##  6 2320127002     1
##  7 2347167796    15
##  8 3977333714    28
##  9 4020332650     8
## 10 4319703577    26
## # ℹ 14 more rows

# There are 24 unique id's in the sleep data set

sleep_ids <- sleep %>%
group_by(SleepDay) %>%
  summarize(UniqueIds = n_distinct(Id))


print(sleep_ids)

## # A tibble: 31 × 2
##    SleepDay            UniqueIds
##    <dttm>                  <int>
##  1 2016-04-12 00:00:00        13
##  2 2016-04-13 00:00:00        14
##  3 2016-04-14 00:00:00        13
##  4 2016-04-15 00:00:00        17
##  5 2016-04-16 00:00:00        14
##  6 2016-04-17 00:00:00        12
##  7 2016-04-18 00:00:00        10
##  8 2016-04-19 00:00:00        14
##  9 2016-04-20 00:00:00        15
## 10 2016-04-21 00:00:00        15
## # ℹ 21 more rows

# Even with 24 unique Id's, about 12-15 Id's average for each day.

head(sleep)

## # A tibble: 6 × 5
##       Id SleepDay            TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
##    <dbl> <dttm>                          <dbl>              <dbl>          <dbl>
## 1 1.50e9 2016-04-12 00:00:00                 1                327            346
## 2 1.50e9 2016-04-13 00:00:00                 2                384            407
## 3 1.50e9 2016-04-15 00:00:00                 1                412            442
## 4 1.50e9 2016-04-16 00:00:00                 2                340            367
## 5 1.50e9 2016-04-17 00:00:00                 1                700            712
## 6 1.50e9 2016-04-19 00:00:00                 1                304            320

colnames(sleep)

## [1] "Id"                 "SleepDay"           "TotalSleepRecords" 
## [4] "TotalMinutesAsleep" "TotalTimeInBed"

From this data we will check how many hours are the users in bed and how many of those hours are spent asleep.

sleep_s <- sleep %>% 
  group_by(SleepDay) %>% 
  mutate(date=as.Date(SleepDay)) %>% 
  summarize(sleep=round(mean(TotalMinutesAsleep/60),1),
            in_bed=round(mean(TotalTimeInBed/60),1)
            )
  
# Convert it to long format
sleep_s <- pivot_longer(
    data = sleep_s,
    cols = c(sleep,in_bed),
    values_to = "Value",
    names_to = "category"
  )

sleep_s <- sleep_s %>% 
  arrange(category)


ggplot(sleep_s, aes(x=SleepDay,y=Value,fill=category)) + 
  geom_col(position = "identity") + 
  labs(title = "Average Sleep and time in bed over time",
       subtitle = "From 12th April to 12th March 2016",
       x = "Date", y = "Hours") +
  scale_fill_manual(values = c("in_bed" = "darkslateblue", "sleep" = "deeppink3")) +
  geom_text(aes(label=Value),vjust=1.5,size=2, color="white",fontface="bold") +
  theme_minimal()

We can also check how much time users spend in the bed but not asleep in isolation. From this data we can find weekly patterns.

not_asleep <- sleep %>% 
  group_by(SleepDay) %>% 
  summarize(time_not_asleep=mean(TotalTimeInBed-TotalMinutesAsleep)) %>% 
  mutate(weekday=wday(SleepDay,label=TRUE))

mean(not_asleep$time_not_asleep)

## [1] 38.73216

ggplot(not_asleep,aes(x=SleepDay,y=time_not_asleep,fill=weekday)) +
  geom_col() + 
  theme_minimal() +
  labs(y="time (in minutes)",x="Sleep Day",
       title = "Time users spent in the bed while awake",
       subtitle = "An average of 38 minutes were spent awake on bed") +
  geom_text(aes(label=round(time_not_asleep)),vjust=-0.5,size=3)

We can see that users tend to spend more time in bed awake on the weekends. This information will be useful for the users if they want to change their habits and want to go to bed early.

Users can also compare their sleep data to that of the population.

# Let's now pick a random date and a random Id and then we will compare that Id's data to the population average.

#Picking a random date
random_date_sleep <- sample(sleep$SleepDay,1)
print(random_date_sleep)

## [1] "2016-05-03 UTC"

# There is a problem in picking a random date and random user here.
# Because for some users, the the sleep data is not available for every date.
# so if a random user is picked whose data does not exist for the random date, then the sample graph will not show up because it will not exist.

# Because of this problem, we will just pick a user and a date that has that user's data.

sleep_pop <- sleep %>% 
  filter(SleepDay==as.Date("2016-04-13 UTC")) %>% 
  filter(Id!=5553957443) %>% 
  summarize(
            TimeAsleep=round(mean(TotalMinutesAsleep/60),1),
            TimeInBed=round(mean(TotalTimeInBed/60),1)
            ) %>% 
  mutate(type="population")

# Do the same for the sample
sleep_sam <- sleep %>% 
  filter(SleepDay==as.Date("2016-04-13 UTC")) %>% 
  filter(Id==5553957443) %>% 
  summarize(
            TimeAsleep=round(TotalMinutesAsleep/60,1),
            TimeInBed=round(TotalTimeInBed/60,1)
            ) %>% 
  mutate(type="sample")
# Make a new data frame combining this information

com_sleep <- bind_rows(sleep_pop,sleep_sam)

# Arranging the columns

com_sleep <-  com_sleep %>% 
  select(type,TimeAsleep,TimeInBed) %>% 
  print()

## # A tibble: 2 × 3
##   type       TimeAsleep TimeInBed
##   <chr>           <dbl>     <dbl>
## 1 population        7.1       7.8
## 2 sample            7.6       8.1

com_sleep <- pivot_longer(
  data = com_sleep,
  cols = c(TimeAsleep,TimeInBed),
  values_to = "hours",
  names_to = "Activity_type")

com_sleep <- com_sleep %>% 
  arrange(-hours)

# Finally, Let's graph it.
ggplot(com_sleep,aes(x=type,y=hours,fill=Activity_type)) +
  geom_col(position = "identity",width = 0.8) +
  scale_fill_manual(values = c(TimeAsleep="aquamarine4",TimeInBed="brown2")) + 
  geom_text(aes(label=hours),vjust=1.5,color="white",fontface="bold") +
  theme_minimal() +
  theme(panel.grid = element_blank()) +
  theme(axis.text = element_text(size=10)) + 
  labs(x="Type",y="Hours",title = "Comparing a random user to the mean of population")

Case Study 2 v_1

Syed Awais Ali Shah

2023-09-20