Bellabeat Case Study

ASK:

In this case study, we are asked by a company, Bellabeat, to track and identify trends in their device usage by their customers. Bellabeat is a fitness device company with a total of 3 devices: Leaf, Time, and Spring. Bellabeat also provides a subscription-based membership with a fully guided help into their nutrition, activity, sleep, health and beauty based on their lifestyle and goals.

Goals:

The shareholder of Bellabeat has specifically asked 3 major questions to be solved:

What are some visible trends in their smart device usage?
How could these trends apply to Bellabeat Customers?
And how could these trends help influence Bellabeat’s future marketing strategy?

Important People:

Urska Srsen (Cofounder and Chief Crative Officer of Bellabeat)
Sando Mur (Cofounder and Mathematicion of Bellabeat)
Bellabeat’s Marketing Team

PREPARE:

The data was given from Kaggle, which contains a total of 18 csv files. The CSV files contain duplicate information where one is saved in the Long format and the other in the Wide format. The licensing of this dataset follows the Public Domain CCO: Pubic Domain

About the Data:

Although the Kaggle page did not contain a description of the dataset, a data dictionary was found from an outside resource for the dataset (Fitabase Fitbit Data Dictionary as of 2:14:24).

The dataset contains personal fitness information that are tracked by the Bellabeat devices or manually inputted by the verified users. The total number of verified users in this dataset is said to be 30, with their age, name, and sex all kept unknown. All thirty of the users have consented to the submission of their personal tracked data, which includes: heart rate, active minute, active intensity, sleep measurement, steps, weight, and MET. Each specific user is identified with a unique numeric ID.

Limitations:

Because of the small sample size of 30 users and the ommitment of some of their personal information, we cannot fully say the upcoming analysis is a 100%. To do so, we would need a bigger pool of data that we can analyze.

PROCESS:

First, I downloaded the necessary libraries and then uploaded the CSV files that I wanted onto R.

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(tidyr)
library(readr)
library(lubridate)

dailyActivity <- read_csv("C:/Users/yosup/Downloads/archive (3)/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")

## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

weightLog <- read_csv("C:/Users/yosup/Downloads/archive (3)/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")

## Rows: 67 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
## lgl (1): IsManualReport
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

sleepDay <- read_csv("C:/Users/yosup/Downloads/archive (3)/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")

## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

heartrate_seconds <- read_csv("C:/Users/yosup/Downloads/archive (3)/Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv")

## Rows: 2483658 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Time
## dbl (2): Id, Value
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

“dailyActivity” is the dataset that contains all of the information of all other dataset except for data regarding sleep and heart rate. So using “dailyActivity” we check the total number of users in the dataset.

n_distinct(dailyActivity$Id)

## [1] 33

We can see that there are a total of 33 different users in the dataset whereas in the data dictionary stated that there are only a total of 30. This can be due to errors by the person gathering data or some users might have more than one ID.
We will start with weightLog dataset. We want to see how many users are in this dataset

n_distinct(weightLog$Id)

## [1] 8

We see that there are a total of 8 distinct users, so immediately we are aware that not all users used their fitness device to track their weight.
We also want to see how many of the users activities were manually inputted.

length(which(weightLog$IsManualReport==TRUE))

## [1] 41

length(weightLog$IsManualReport)

## [1] 67

Out of the 67 entries in the weightLog dataset, there were a total of 41 entries that were manually inputted by the user.
I noticed that there were a lot of NA in the Fat column of the weightLog dataset, so we check to see how much there really was.

sum(is.na(weightLog$Fat))

## [1] 65

65 out of 67 entries were NA in the Fat Column so we can safely assume that the users did not use the smart devices to track their “Fat”
To get an even better understanding of the data, I want to see the time span of the activity log date. I got the earliest date and the latest data recorded and saved them into “date_range_sleepDay”. We see that the earliest entry date was 4/12/2016 at 12 AM and the last data entry was at 5/9/2016 12 AM. So we can say that the data was at most measured from that date gap, almost a month.

date_range_sleepDay <- range(sleepDay$SleepDay)
date_range_sleepDay

## [1] "4/12/2016 12:00:00 AM" "5/9/2016 12:00:00 AM"

I also noticed that dataset may have some users having more entries than other users in the dataset. To find this out, I counted the number of occurrence of each user ID’s in the dataset and then calculated the standard deviation.

sdEntries <- sleepDay %>% count(Id)
sd(sdEntries$n)

## [1] 11.49661

We get a large standard deviation of 11.49661, which tells us that rather than having an even amount of entries by all users, it is mostly dominated by some users.
Now we look at the TotalMinutesAsleep and the TotalTimeInBed column of the dataset. To begin, I want to calculate the mean of both.

mean(sleepDay$TotalMinutesAsleep)

## [1] 419.4673

hist(sleepDay$TotalMinutesAsleep)

mean(sleepDay$TotalTimeInBed)

## [1] 458.6392

hist(sleepDay$TotalTimeInBed)

We get that on average, the users slept around 419 minutes, while also spending an average of 459 minutes in bed (including the sleeping time). Therefore, on average, around 40 minutes were spent awake in bed. The histograms are there to just help me visualize the data.
To make the sleepDay dataset a bit more to my liking, I created a new column “TimeDiff” which shows the difference between the time spent in bed and the time spent actually sleeping.

sleepDay <- transform(sleepDay, TimeDiff = abs(sleepDay$TotalMinutesAsleep - sleepDay$TotalTimeInBed))

I will use this new column later on in my analysis.
Now taking a look at the heartrate dataset, we begin by finding the max/min of the Time and Value columns. I also want to see the mean and the spread of the Value column.

range(heartrate_seconds$Time)

## [1] "4/12/2016 1:00:00 AM" "5/9/2016 9:59:59 PM"

hist(heartrate_seconds$Value)

summary(heartrate_seconds$Value)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   36.00   63.00   73.00   77.33   88.00  203.00

sd(heartrate_seconds$Value)

## [1] 19.4045

Again, similarly to the sleepDay dataset, the date span is from 4/12/16 to 5/9/16, the only difference being that the time starts from 1 AM to 9:59:59 PM and instead of 12. We also see that the lowest heartrate recorded was 36 and the highest was 206.
A normal adult resting heart rate is 60-100 bpm, but 40-60 bpm can be normal when one is sleeping, so for the 36 the user might have been sleeping or perhaps the device was broken
The average heartrate for a person working out in their 20’s is 100-170 bpm with an average maximum of 200. So we might be able to assume that the user was working out or having an emergency (or perhaps another error with the device).
Now I want to create a simpler version of the heartrate_seconds because of its large size.

max_min_heart8 <- heartrate_seconds %>% group_by(Id) %>% summarize(across(everything(), max))
max_min_heart8 <- max_min_heart8 %>% rename(maxTime = Time, maxValue = Value)
temp <- heartrate_seconds %>% group_by(Id) %>% summarize(across(everything(), min))

max_min_heart8 <- cbind(max_min_heart8, temp$Time, temp$Value)
max_min_heart8 <- max_min_heart8 %>% rename(minTime = `temp$Time`, minValue = `temp$Value`)
max_min_heart8 <- max_min_heart8 %>% relocate(minTime,  .before = maxValue)
 

max_min_heart8 <- cbind(max_min_heart8, heartrate_seconds %>% count(Id))
max_min_heart8 <- max_min_heart8 %>% select(-6)
max_min_heart8 <- max_min_heart8 %>% rename(Count = n)
max_min_heart8

##            Id              maxTime               minTime maxValue minValue
## 1  2022484408  5/9/2016 9:59:55 AM  4/12/2016 1:00:00 PM      203       38
## 2  2026352035  5/9/2016 7:49:45 PM  4/17/2016 5:30:20 AM      125       63
## 3  2347167796 4/29/2016 6:56:50 AM  4/12/2016 1:00:10 PM      195       49
## 4  4020332650  5/9/2016 9:59:59 PM  4/12/2016 1:00:00 AM      191       46
## 5  4388161847  5/9/2016 9:59:55 PM  4/13/2016 1:00:00 AM      180       39
## 6  4558609924  5/9/2016 9:59:55 AM  4/12/2016 1:00:00 PM      199       44
## 7  5553957443  5/9/2016 9:59:55 PM  4/12/2016 1:00:05 AM      165       47
## 8  5577150313  5/9/2016 9:59:50 PM  4/12/2016 1:00:00 AM      174       36
## 9  6117666160  5/9/2016 9:59:45 AM  4/15/2016 1:00:00 PM      189       52
## 10 6775888955  5/7/2016 9:59:55 AM 4/13/2016 10:00:00 PM      177       55
## 11 6962181067  5/9/2016 9:59:55 AM  4/12/2016 1:00:00 AM      184       47
## 12 7007744171  5/6/2016 9:59:50 AM  4/12/2016 1:00:00 PM      166       54
## 13 8792009665  5/4/2016 9:59:45 AM  4/12/2016 1:00:00 PM      158       43
## 14 8877689391  5/9/2016 9:59:56 PM  4/12/2016 1:00:05 PM      180       46
##     Count
## 1  154104
## 2    2490
## 3  152683
## 4  285461
## 5  249748
## 6  192168
## 7  255174
## 8  248560
## 9  158899
## 10  32771
## 11 266326
## 12 133592
## 13 122841
## 14 228841

The max_min_heart8 basically contains a summary of each unique user ID that was in the heartrate_seconds dataset.It shows the max/min of both their respective heart rate value and the date of entry along with the number of occurrence of each ID in the heartrate_seconds dataset. I will use this dataset later for analysis.
The DailyActivity dataset is also big, so just I like did for the heartrate_seconds dataset, I will create another dataset that summarizes the dailyActivity dataset, however, this time I will only find the max of each column for each ID for now.

maxmin_dailyActivity <- dailyActivity %>% group_by(Id) %>% summarize(across(everything(), max))
glimpse(dailyActivity)

## Rows: 940
## Columns: 15
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate             <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ TotalSteps               <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes        <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes      <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes     <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes         <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories                 <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…

I saw that most of the rows in the LoggedActivitiesDistance column had zero’s so I had to see how many of the entries were actually logged.

#filtering out all the unlogged days of activity
active_days <- filter(dailyActivity, dailyActivity$LoggedActivitiesDistance > 0)
length(active_days)

## [1] 15

#only 4 unique users had LoggedActivityDistance > 0
n_distinct(active_days$Id)

## [1] 4

We see that there was only a total of 15 entries that were manually logged and that only 4 distinct users had done so.
This might mean 2 things: either only 4/15 use this technology willingly or the device had an error causing the user to input the distance themselves. Or perhaps, the user were not wearing the device during their activity and therefore had to input the distance themselves.
What I want to do now is create a new dataset that contains all the information that I needed (or most), so instead of using a multiple datasets, I can just use one or two. First, I created a sleepTemp that will temporarily hold all the information I need from sleepDay while I try to merge it with dailyActivity.

sleepTemp <- sleepDay %>% separate(SleepDay, into = c("NewDate", "Hour", "AM_or_PM"), sep = " ") %>% rename
View(sleepTemp)
merged_data <- left_join(dailyActivity, sleepTemp, by = c('Id' = 'Id','ActivityDate' = 'NewDate'), relationship = "many-to-many") 
merged_data <- merged_data %>% relocate(ActivityDate, .before = TotalSteps)
n_distinct(merged_data$Id)

## [1] 33

summary(merged_data)

##        Id            ActivityDate         TotalSteps    TotalDistance   
##  Min.   :1.504e+09   Length:943         Min.   :    0   Min.   : 0.000  
##  1st Qu.:2.320e+09   Class :character   1st Qu.: 3795   1st Qu.: 2.620  
##  Median :4.445e+09   Mode  :character   Median : 7439   Median : 5.260  
##  Mean   :4.858e+09                      Mean   : 7652   Mean   : 5.503  
##  3rd Qu.:6.962e+09                      3rd Qu.:10734   3rd Qu.: 7.720  
##  Max.   :8.878e+09                      Max.   :36019   Max.   :28.030  
##                                                                         
##  TrackerDistance  LoggedActivitiesDistance VeryActiveDistance
##  Min.   : 0.000   Min.   :0.000            Min.   : 0.000    
##  1st Qu.: 2.620   1st Qu.:0.000            1st Qu.: 0.000    
##  Median : 5.260   Median :0.000            Median : 0.220    
##  Mean   : 5.489   Mean   :0.110            Mean   : 1.504    
##  3rd Qu.: 7.715   3rd Qu.:0.000            3rd Qu.: 2.065    
##  Max.   :28.030   Max.   :4.942            Max.   :21.920    
##                                                              
##  ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
##  Min.   :0.0000           Min.   : 0.000      Min.   :0.000000       
##  1st Qu.:0.0000           1st Qu.: 1.950      1st Qu.:0.000000       
##  Median :0.2400           Median : 3.380      Median :0.000000       
##  Mean   :0.5709           Mean   : 3.349      Mean   :0.001601       
##  3rd Qu.:0.8050           3rd Qu.: 4.790      3rd Qu.:0.000000       
##  Max.   :6.4800           Max.   :10.710      Max.   :0.110000       
##                                                                      
##  VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
##  Min.   :  0.00    Min.   :  0.00      Min.   :  0          Min.   :   0.0  
##  1st Qu.:  0.00    1st Qu.:  0.00      1st Qu.:127          1st Qu.: 729.0  
##  Median :  4.00    Median :  7.00      Median :199          Median :1057.0  
##  Mean   : 21.24    Mean   : 13.63      Mean   :193          Mean   : 990.4  
##  3rd Qu.: 32.00    3rd Qu.: 19.00      3rd Qu.:264          3rd Qu.:1229.0  
##  Max.   :210.00    Max.   :143.00      Max.   :518          Max.   :1440.0  
##                                                                             
##     Calories        Hour             AM_or_PM         TotalSleepRecords
##  Min.   :   0   Length:943         Length:943         Min.   :1.000    
##  1st Qu.:1830   Class :character   Class :character   1st Qu.:1.000    
##  Median :2140   Mode  :character   Mode  :character   Median :1.000    
##  Mean   :2308                                         Mean   :1.119    
##  3rd Qu.:2796                                         3rd Qu.:1.000    
##  Max.   :4900                                         Max.   :3.000    
##                                                       NA's   :530      
##  TotalMinutesAsleep TotalTimeInBed     TimeDiff     
##  Min.   : 58.0      Min.   : 61.0   Min.   :  0.00  
##  1st Qu.:361.0      1st Qu.:403.0   1st Qu.: 17.00  
##  Median :433.0      Median :463.0   Median : 25.00  
##  Mean   :419.5      Mean   :458.6   Mean   : 39.17  
##  3rd Qu.:490.0      3rd Qu.:526.0   3rd Qu.: 40.00  
##  Max.   :796.0      Max.   :961.0   Max.   :371.00  
##  NA's   :530        NA's   :530     NA's   :530

Now I want to organize and add to the merged_data dataset. To make things easier, I wanted to simplify my findings into weekdays as using specific dates may be hard to understand when analyzing, so I add a new column, WeekDay, to the dataset, which shows what day of the week it was when the entry was made.

merged_data <- merged_data %>% mutate(Weekday = weekdays(mdy(ActivityDate)))
merged_data

## # A tibble: 943 × 22
##            Id ActivityDate TotalSteps TotalDistance TrackerDistance
##         <dbl> <chr>             <dbl>         <dbl>           <dbl>
##  1 1503960366 4/12/2016         13162          8.5             8.5 
##  2 1503960366 4/13/2016         10735          6.97            6.97
##  3 1503960366 4/14/2016         10460          6.74            6.74
##  4 1503960366 4/15/2016          9762          6.28            6.28
##  5 1503960366 4/16/2016         12669          8.16            8.16
##  6 1503960366 4/17/2016          9705          6.48            6.48
##  7 1503960366 4/18/2016         13019          8.59            8.59
##  8 1503960366 4/19/2016         15506          9.88            9.88
##  9 1503960366 4/20/2016         10544          6.68            6.68
## 10 1503960366 4/21/2016          9819          6.34            6.34
## # ℹ 933 more rows
## # ℹ 17 more variables: LoggedActivitiesDistance <dbl>,
## #   VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## #   LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## #   VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## #   LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>,
## #   Hour <chr>, AM_or_PM <chr>, TotalSleepRecords <dbl>, …

Now, I believe I am ready to begin analyzing and visualizing the data. I will return to making more alterations when needed.

ANALYZE:

I want to see the difference of activity per Weekday, so I created a bar chart using the merged_data.

df <- data.frame(table(merged_data$Weekday))
ggplot(df) + geom_col(mapping=aes(x = factor(Var1, level = c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')),, y = Freq, fill = Var1)) + labs(title = "Amount Activity Per Date")

summary(df)

##         Var1        Freq      
##  Friday   :1   Min.   :121.0  
##  Monday   :1   1st Qu.:123.0  
##  Saturday :1   Median :126.0  
##  Sunday   :1   Mean   :134.7  
##  Thursday :1   3rd Qu.:149.0  
##  Tuesday  :1   Max.   :152.0  
##  Wednesday:1

Although there the graph does not show a large difference between the days, it is noticeable that the there are mostly active during Tuesday, Wednesday and Thursday, while they the least active day was on Monday.
I also want to see the difference when it comes to individuals in each aspect of the data. First, I looked at the steps discrepancy, while separating them by Id and Weekdays
After making the first try in graphing, I thought it would look better if I put in descending order, so I created another dataset to hold the data and put it in descending order. I will use this dataset to just use for the ordering of the facet_wrap.
I added used both the facet_grid and facet_wrap to visually represent the data in 2 different ways

hold_data <- merged_data %>% group_by(Id) %>% summarize(TotalStep = sum(TotalSteps))
hold_data <- hold_data %>% arrange(-TotalStep)

ggplot(merged_data ,mapping=aes(x= factor(Weekday, level=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')), y = TotalSteps, fill = Weekday)) + geom_col()+
  facet_grid(~factor(Id, levels = unique(hold_data$Id))) + theme(axis.text.x = element_blank()) + labs(title = "Total Steps per Day for Each Id")

ggplot(merged_data ,mapping=aes(x= factor(Weekday, level=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')), y = TotalSteps, fill = Weekday)) + geom_col()+
  facet_wrap(~factor(Id, levels = unique(hold_data$Id))) + theme(axis.text.x = element_blank()) + labs(title = "Total Steps per Day for Each Id")

We can see from this graph that while most of the users are somewhat to very active, some are close no activity at all. Now I want to see this in the overall view as I did for the occurence above.

ggplot(merged_data,mapping=aes(x= factor(Weekday, level=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')), y = TotalSteps, fill = Weekday)) + geom_col()+
  theme(axis.text.x = element_blank()) + labs(title = "Total Steps per Day")

This bargraph is somewhat similar to the first graph we generated. We see that the Weekdays that have the most steps are the same as the Weekdays having the most entries. However, we see that the Weekday with the least amount of steps is on Saturday, while the the Weekday with the least entries was on Sunday. The most simple explanation would be that, while having less amount of entries, each entry or some entry had more steps when compared to the entries on Thursday. However, this is not a final assumption but just a guess.
Now, I want to see if sleep time or quality of sleep attributed to a higher or lower active day. First, I want to see if there was a relationship between the amount of Time idling in bed without sleeping and the amount steps walked for each entry.

ggplot(merged_data, mapping=aes(x = TimeDiff, y = TotalSteps)) + geom_point()+ labs(title = "Relationship between Time Awake in Bed and Total Steps") + geom_smooth()

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Warning: Removed 530 rows containing non-finite values (`stat_smooth()`).

## Warning: Removed 530 rows containing missing values (`geom_point()`).

We can clearly see that there is no obvious relationship between the time slept and the TotalSteps walked. So we cannot say there is a relationship between these two statistics.
Now, I am curious if it causes a difference in the Total Minutes Active. Before I do this, I noticed that I did not have a TotalMin column in the dataset.

merged_data <- transform(merged_data, TotalMin=(VeryActiveMinutes+FairlyActiveMinutes+LightlyActiveMinutes+SedentaryMinutes))
merged_data <- merged_data %>% relocate(TotalMin, .before = TotalSteps)

Now, I want to the same thing as I did for the graphs above and see the relationship between sleep and the active minutes.

ggplot(merged_data, mapping=aes(x = TimeDiff, y = TotalMin)) + geom_point()+ labs(title = "Relationship between Time Awake in Bed and Total Steps") + geom_smooth()

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Warning: Removed 530 rows containing non-finite values (`stat_smooth()`).

## Warning: Removed 530 rows containing missing values (`geom_point()`).

Although we can see that there is somewhat a negative relationship (as in the more time you spend awake in bed, the less minutes you are active), there are a lot of outliers and the points are too crowded in one place. However, if we just look at the Time Spent Awake in Bed <100, the negative relationship is clear.
Now we compare, the Total minutes per day.

ggplot(merged_data,mapping=aes(x= factor(Weekday, level=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')), y = TotalMin, fill = Weekday)) + geom_col()+
  theme(axis.text.x = element_blank()) + labs(title = "Total Minutes per Day")

This bargraph matches with the first bargraph, which compares theamount of entries per day, except for the fact that the smallest day was on Monday and not Sunday. It also somewhat matches with the second bargraph, the one that compares the TotalSteps per day, except for the fact that Saturay was the fourth most walked day while in this bargraph, it is on Friday. Perhaps, the longer walks explain the difference in between the first 2 bargraphs.
This leads me to question how does all of this compare? As in, how does all these stats (TotalMin,TotalSteps,TotalOccurence,TotalDistance) compare when looked at in perspective of the time spend awake in bed?

library(cowplot)

## Warning: package 'cowplot' was built under R version 4.3.3

## 
## Attaching package: 'cowplot'

## The following object is masked from 'package:lubridate':
## 
##     stamp

View(merged_data)
TotalMinGraph <- ggplot(merged_data,mapping=aes(x= factor(Weekday, level=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')), y = TotalMin, fill = Weekday)) + geom_col()+
  theme(axis.text.x = element_blank()) + labs(title = "Total Minutes per Day", x = NULL)

TotalStepsGraph <- ggplot(merged_data,mapping=aes(x= factor(Weekday, level=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')), y = TotalSteps, fill = Weekday)) + geom_col()+
  theme(axis.text.x = element_blank()) + labs(title = "Total Steps per Day", x = NULL)

TotalDistGraph <- ggplot(merged_data,mapping=aes(x= factor(Weekday, level=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')), y = TotalDistance, fill = Weekday)) + geom_col()+
  theme(axis.text.x = element_blank()) + labs(title = "Total Distance per Day", x = NULL)


df <- data.frame(table(merged_data$Weekday))
TotalActGraph <- ggplot(df) + geom_col(mapping=aes(x = factor(Var1, level = c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')),, y = Freq, fill = Var1)) + labs(title = "Amount Activity Per Date", x = NULL) + theme(axis.text.x = element_blank())
 

plot_grid(TotalMinGraph, TotalStepsGraph, TotalDistGraph, TotalActGraph, labels = "AUTO")

We see that when looking at the general order of the biggest to smallest, the Total Minutes graph and Amount Activity graph match with each other, while Total Steps and Total Distance match. Looking at these 4 graphs, we can confidently say that the days that the users are most active and the devices most used are on Tuesday, Wednesday, and Thursday. Not only that, the 2 least active days are Sunday and Monday while the other 2 days constantly remain somewhere in the middle.
Now, this leads to want to look at the graphs when compared to the Time Spent Awake in Bed.

library(cowplot)
TimeDiff_n_TotalSteps <- ggplot(merged_data, mapping=aes(x = TimeDiff, y = TotalSteps)) + geom_point()+ labs(title = "Relationship between Time Awake in Bed and Total Steps") + geom_smooth()

TimeDiff_n_TotalDist <- ggplot(merged_data, mapping=aes(x = TimeDiff, y = TotalDistance)) + geom_point()+labs(title = "Relationship between Time Awake in Bed and Total Distance") + geom_smooth()

TimeDiff_n_TotalMin <- ggplot(merged_data, mapping=aes(x= TimeDiff, y = TotalMin)) + geom_point()+labs(title= "Relationship between Time Awake in Bed and Total Minutes") + geom_smooth()

cor.test(merged_data$TimeDiff, merged_data$TotalMin)

## 
##  Pearson's product-moment correlation
## 
## data:  merged_data$TimeDiff and merged_data$TotalMin
## t = -4.4746, df = 411, p-value = 9.925e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.305665 -0.121561
## sample estimates:
##        cor 
## -0.2155274

cor.test(merged_data$TimeDiff, merged_data$TotalDistance)

## 
##  Pearson's product-moment correlation
## 
## data:  merged_data$TimeDiff and merged_data$TotalDistance
## t = 0.12105, df = 411, p-value = 0.9037
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.09057597  0.10240631
## sample estimates:
##         cor 
## 0.005970762

cor.test(merged_data$TimeDiff, merged_data$TotalSteps)

## 
##  Pearson's product-moment correlation
## 
## data:  merged_data$TimeDiff and merged_data$TotalSteps
## t = 0.54976, df = 411, p-value = 0.5828
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.06956893  0.12327966
## sample estimates:
##        cor 
## 0.02710758

plot_grid(TimeDiff_n_TotalSteps, TimeDiff_n_TotalDist, TimeDiff_n_TotalMin, labels = "AUTO")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Warning: Removed 530 rows containing non-finite values (`stat_smooth()`).

## Warning: Removed 530 rows containing missing values (`geom_point()`).

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Warning: Removed 530 rows containing non-finite values (`stat_smooth()`).
## Removed 530 rows containing missing values (`geom_point()`).

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Warning: Removed 530 rows containing non-finite values (`stat_smooth()`).
## Removed 530 rows containing missing values (`geom_point()`).

Looking at the graphs, we cannot deduct a clear relationship between the statistics. What was interesting to me was that while some entries had a total of O minutes active, there were no entries with 0 steps or 0 distance. Perhaps this is the difference between the way the Total metrics were calculated as the Total Minutes was the only one that I had to manually create and calculate (I included the sedentary active minutes into my Total Minutes and perhaps they did not add those types into the other Total calculation of the statistics).
Looking at these graphs, I think it might be a stretch to say there is a connection with the amount of time spent awake in bed as there is no clear relationship. Although, one thing we could say is that there exists a negative relationship with Total Minutes spent active. In the graph between TimeDiff and TotalMin, the correlation score was -0.21, which the largest out of the three.
Maybe the amount slept has a clearer relationship? This time I will do the same as above, but instead use the TotalMinutesAsleep.

TimeDiff_n_TotalSteps <- ggplot(merged_data, mapping=aes(x = TotalMinutesAsleep, y = TotalSteps)) + geom_point()+ labs(title = "Relationship between Time Awake in Bed and Total Steps") + geom_smooth()

TimeDiff_n_TotalDist <- ggplot(merged_data, mapping=aes(x = TotalMinutesAsleep, y = TotalDistance)) + geom_point()+labs(title = "Relationship between Time Awake in Bed and Total Distance") + geom_smooth()

TimeDiff_n_TotalMin <- ggplot(merged_data, mapping=aes(x= TotalMinutesAsleep, y = TotalMin)) + geom_point()+labs(title= "Relationship between Time Awake in Bed and Total Minutes") + geom_smooth()

cor.test(merged_data$TotalMinutesAsleep, merged_data$TotalMin)

## 
##  Pearson's product-moment correlation
## 
## data:  merged_data$TotalMinutesAsleep and merged_data$TotalMin
## t = -16.456, df = 411, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6850669 -0.5683003
## sample estimates:
##        cor 
## -0.6302341

cor.test(merged_data$TotalMinutesAsleep, merged_data$TotalDistance)

## 
##  Pearson's product-moment correlation
## 
## data:  merged_data$TotalMinutesAsleep and merged_data$TotalDistance
## t = -3.5428, df = 411, p-value = 0.0004414
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.26424789 -0.07692598
## sample estimates:
##        cor 
## -0.1721427

cor.test(merged_data$TotalMinutesAsleep, merged_data$TotalSteps)

## 
##  Pearson's product-moment correlation
## 
## data:  merged_data$TotalMinutesAsleep and merged_data$TotalSteps
## t = -3.8563, df = 411, p-value = 0.0001336
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.27834209 -0.09203143
## sample estimates:
##        cor 
## -0.1868665

plot_grid(TimeDiff_n_TotalSteps, TimeDiff_n_TotalDist, TimeDiff_n_TotalMin, labels = "AUTO")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Warning: Removed 530 rows containing non-finite values (`stat_smooth()`).

## Warning: Removed 530 rows containing missing values (`geom_point()`).

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Warning: Removed 530 rows containing non-finite values (`stat_smooth()`).
## Removed 530 rows containing missing values (`geom_point()`).

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Warning: Removed 530 rows containing non-finite values (`stat_smooth()`).
## Removed 530 rows containing missing values (`geom_point()`).

Again, looking at this, there is no relationship between TotalMinutesAsleep and the TotalSteps and TotalDistance. However, there is a clear negative relationship between TotalMinutesAsleep and the Total Minute active (outside of a few outliers). This is a curious finding as it shows that the more they slept, the less minutes the users were active for.
The correlation score of TotalMinutesAsleep and TotalMin was -0.63, showing a somewhat strong negative correlation.
Now doing a similar thing, but this time with TotalMinutesInBed.

TimeDiff_n_TotalSteps <- ggplot(merged_data, mapping=aes(x = TotalTimeInBed, y = TotalSteps)) + geom_point()+ labs(title = "Relationship between Time Awake in Bed and Total Steps") + geom_smooth()

TimeDiff_n_TotalDist <- ggplot(merged_data, mapping=aes(x = TotalTimeInBed, y = TotalDistance)) + geom_point()+labs(title = "Relationship between Time Awake in Bed and Total Distance") + geom_smooth()

TimeDiff_n_TotalMin <- ggplot(merged_data, mapping=aes(x= TotalTimeInBed, y = TotalMin)) + geom_point()+labs(title= "Relationship between Time Awake in Bed and Total Minutes") + geom_smooth()

cor.test(merged_data$TotalTimeInBed, merged_data$TotalMin)

## 
##  Pearson's product-moment correlation
## 
## data:  merged_data$TotalTimeInBed and merged_data$TotalMin
## t = -18.09, df = 411, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7162610 -0.6083721
## sample estimates:
##        cor 
## -0.6657822

cor.test(merged_data$TotalTimeInBed, merged_data$TotalDistance)

## 
##  Pearson's product-moment correlation
## 
## data:  merged_data$TotalTimeInBed and merged_data$TotalDistance
## t = -3.2459, df = 411, p-value = 0.001267
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.25076397 -0.06255465
## sample estimates:
##        cor 
## -0.1580949

cor.test(merged_data$TotalTimeInBed, merged_data$TotalSteps)

## 
##  Pearson's product-moment correlation
## 
## data:  merged_data$TotalTimeInBed and merged_data$TotalSteps
## t = -3.3717, df = 411, p-value = 0.0008178
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.25649374 -0.06865199
## sample estimates:
##        cor 
## -0.1640597

plot_grid(TimeDiff_n_TotalSteps, TimeDiff_n_TotalDist, TimeDiff_n_TotalMin, labels = "AUTO")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Warning: Removed 530 rows containing non-finite values (`stat_smooth()`).

## Warning: Removed 530 rows containing missing values (`geom_point()`).

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Warning: Removed 530 rows containing non-finite values (`stat_smooth()`).
## Removed 530 rows containing missing values (`geom_point()`).

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Warning: Removed 530 rows containing non-finite values (`stat_smooth()`).
## Removed 530 rows containing missing values (`geom_point()`).

The finding is similar as the when compared to Total Minutes Slept. The correlation score between the TotalTimeInbed and TotalMin was = -0.6658, showing that there was a decently strong negative relationship between those 2 dataset. The correlation score between TotalTimeInbed and TotalDistance was = -0.1581, and for TotalTimeInBed and TotalSteps = -0.1641.

After cleaning, managing, and optimizing the data for usage, here are the results that I found.
First, I found out that there was a big discrepancy in the usage of the bellabeat devices by the 30 Users given.
The order of the Id was by descending in their respective Total Steps. The difference in the activity is very evident from the visualization. The user that had the most steps (8877689391) had a total of 497,241 steps while the user that had the least amount of steps (4057192912) had a total of 15,352; a difference of 481,889 steps between them. * To put this data into perspective, According to Mayo Clinis, an average American walks between 3,000 to 4,000 steps a day. That means, an average American walks around 90,000 steps per month (using a 30 day month).
So that means the most walked and the least walked person had the difference of about 5.5 average american between them (because the data was collected from 4/12/16 to 5/9/16)
Another thing that I looked into was the difference in activity per weekday.
Although the graphs do differ in their own aspects, it is obvious that the most active day was on Tuesday.
We can also see that the users were most likely to be less active during the Weekends and Monday, and more active on Tuesday, Wednesday, and Thursday.
I also found that the amount of time slept, amount of time spent in bed, and the amount of time spent awake in bed had no obvious and strong correlation with how active the user were (total steps and total distance)

We can see that there does exist a slight U-shaped curve for “TimeSlept vs Total Steps” and “TimeSlept vs Total Distance”. However, because of all the outliers present, we cannot come up with any results.
However there was a strong correlation when it came to the Total Minutes Active
We can clearly see that there is negative relationship between Total Minutes Active, which includes all types of activeness, with Time in Bed, Time Slept, Time Awake in Bed.
This finding was a bit surprising to me because I believed that the more you slept, the more active a person would be. However, the graphs shows that the more the users spend time in bed, awake in bed, and sleeping, the less active the users were.

Call To Action: Optimizing the Usage of the Bellabeat Devices

We found out that there was a huge discrepancy between the users in the activity recorded by the Bellabeat devices. In other to combat this. We also found out that there was a consistent difference in the usage and the activeness of the users recorded. Therefore, in order to push for more usage of their devices I suggest:

Bellabeat should push for an app, downloadable on mobile devices, that connects to the or all of the Bellabeat fitness devices that the customer has. Then by using both the app and the devices, they should send alerts and notification for the users that seem to be less active than most of the other users.
Bellabeat can also choose the day to send the alerts. We saw that the users were most active on Tuesday, Wednesday, and Thursday, and less active on Friday, Saturday, Sunday, and Monday (least active on Sunday and/or Monday). Therefore, Bellabeat should put focus on sending alerts to their users on Sunday and Monday (regardless of their activeness recorded) and also send alerts on Friday and Saturday.
Bellabeat can also build a feature asking the users on how active they desire to be. Depending on the answer from the users, we can send them a daily report of how much they slept and how active they were.
- For example, if the user stated they want to be active for at least 1,000 minutes, their Bellabeat device could recommend sleeping no more than 50 minutes idling in bed.

Bellabeat Case Study

Yosup Kim

2024-03-25

ASK:

PREPARE:

PROCESS:

ANALYZE:

Bellabeat Case Study

Yosup Kim

2024-03-25

ASK:

PREPARE:

PROCESS:

ANALYZE:

SHARE

Call To Action: Optimizing the Usage of the Bellabeat Devices