In this case study, we are asked by a company, Bellabeat, to track and identify trends in their device usage by their customers. Bellabeat is a fitness device company with a total of 3 devices: Leaf, Time, and Spring. Bellabeat also provides a subscription-based membership with a fully guided help into their nutrition, activity, sleep, health and beauty based on their lifestyle and goals.
Goals:
The shareholder of Bellabeat has specifically asked 3 major questions to be solved:
What are some visible trends in their smart device usage?
How could these trends apply to Bellabeat Customers?
And how could these trends help influence Bellabeat’s future marketing strategy?
Important People:
The data was given from Kaggle, which contains a total of 18 csv files. The CSV files contain duplicate information where one is saved in the Long format and the other in the Wide format. The licensing of this dataset follows the Public Domain CCO: Pubic Domain
About the Data:
Although the Kaggle page did not contain a description of the dataset, a data dictionary was found from an outside resource for the dataset (Fitabase Fitbit Data Dictionary as of 2:14:24).
The dataset contains personal fitness information that are tracked by the Bellabeat devices or manually inputted by the verified users. The total number of verified users in this dataset is said to be 30, with their age, name, and sex all kept unknown. All thirty of the users have consented to the submission of their personal tracked data, which includes: heart rate, active minute, active intensity, sleep measurement, steps, weight, and MET. Each specific user is identified with a unique numeric ID.
Limitations:
Because of the small sample size of 30 users and the ommitment of some of their personal information, we cannot fully say the upcoming analysis is a 100%. To do so, we would need a bigger pool of data that we can analyze.
First, I downloaded the necessary libraries and then uploaded the CSV files that I wanted onto R.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(tidyr)
library(readr)
library(lubridate)
dailyActivity <- read_csv("C:/Users/yosup/Downloads/archive (3)/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
weightLog <- read_csv("C:/Users/yosup/Downloads/archive (3)/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")
## Rows: 67 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
## lgl (1): IsManualReport
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sleepDay <- read_csv("C:/Users/yosup/Downloads/archive (3)/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
heartrate_seconds <- read_csv("C:/Users/yosup/Downloads/archive (3)/Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv")
## Rows: 2483658 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Time
## dbl (2): Id, Value
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
n_distinct(dailyActivity$Id)
## [1] 33
We can see that there are a total of 33 different users in the dataset whereas in the data dictionary stated that there are only a total of 30. This can be due to errors by the person gathering data or some users might have more than one ID.
We will start with weightLog dataset. We want to see how many users are in this dataset
n_distinct(weightLog$Id)
## [1] 8
We see that there are a total of 8 distinct users, so immediately we are aware that not all users used their fitness device to track their weight.
We also want to see how many of the users activities were manually inputted.
length(which(weightLog$IsManualReport==TRUE))
## [1] 41
length(weightLog$IsManualReport)
## [1] 67
Out of the 67 entries in the weightLog dataset, there were a total of 41 entries that were manually inputted by the user.
I noticed that there were a lot of NA in the Fat column of the weightLog dataset, so we check to see how much there really was.
sum(is.na(weightLog$Fat))
## [1] 65
65 out of 67 entries were NA in the Fat Column so we can safely assume that the users did not use the smart devices to track their “Fat”
To get an even better understanding of the data, I want to see the time span of the activity log date. I got the earliest date and the latest data recorded and saved them into “date_range_sleepDay”. We see that the earliest entry date was 4/12/2016 at 12 AM and the last data entry was at 5/9/2016 12 AM. So we can say that the data was at most measured from that date gap, almost a month.
date_range_sleepDay <- range(sleepDay$SleepDay)
date_range_sleepDay
## [1] "4/12/2016 12:00:00 AM" "5/9/2016 12:00:00 AM"
sdEntries <- sleepDay %>% count(Id)
sd(sdEntries$n)
## [1] 11.49661
We get a large standard deviation of 11.49661, which tells us that rather than having an even amount of entries by all users, it is mostly dominated by some users.
Now we look at the TotalMinutesAsleep and the TotalTimeInBed column of the dataset. To begin, I want to calculate the mean of both.
mean(sleepDay$TotalMinutesAsleep)
## [1] 419.4673
hist(sleepDay$TotalMinutesAsleep)
mean(sleepDay$TotalTimeInBed)
## [1] 458.6392
hist(sleepDay$TotalTimeInBed)
We get that on average, the users slept around 419 minutes, while also spending an average of 459 minutes in bed (including the sleeping time). Therefore, on average, around 40 minutes were spent awake in bed. The histograms are there to just help me visualize the data.
To make the sleepDay dataset a bit more to my liking, I created a new column “TimeDiff” which shows the difference between the time spent in bed and the time spent actually sleeping.
sleepDay <- transform(sleepDay, TimeDiff = abs(sleepDay$TotalMinutesAsleep - sleepDay$TotalTimeInBed))
I will use this new column later on in my analysis.
Now taking a look at the heartrate dataset, we begin by finding the max/min of the Time and Value columns. I also want to see the mean and the spread of the Value column.
range(heartrate_seconds$Time)
## [1] "4/12/2016 1:00:00 AM" "5/9/2016 9:59:59 PM"
hist(heartrate_seconds$Value)
summary(heartrate_seconds$Value)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 36.00 63.00 73.00 77.33 88.00 203.00
sd(heartrate_seconds$Value)
## [1] 19.4045
Again, similarly to the sleepDay dataset, the date span is from 4/12/16 to 5/9/16, the only difference being that the time starts from 1 AM to 9:59:59 PM and instead of 12. We also see that the lowest heartrate recorded was 36 and the highest was 206.
A normal adult resting heart rate is 60-100 bpm, but 40-60 bpm can be normal when one is sleeping, so for the 36 the user might have been sleeping or perhaps the device was broken
The average heartrate for a person working out in their 20’s is 100-170 bpm with an average maximum of 200. So we might be able to assume that the user was working out or having an emergency (or perhaps another error with the device).
Now I want to create a simpler version of the heartrate_seconds because of its large size.
max_min_heart8 <- heartrate_seconds %>% group_by(Id) %>% summarize(across(everything(), max))
max_min_heart8 <- max_min_heart8 %>% rename(maxTime = Time, maxValue = Value)
temp <- heartrate_seconds %>% group_by(Id) %>% summarize(across(everything(), min))
max_min_heart8 <- cbind(max_min_heart8, temp$Time, temp$Value)
max_min_heart8 <- max_min_heart8 %>% rename(minTime = `temp$Time`, minValue = `temp$Value`)
max_min_heart8 <- max_min_heart8 %>% relocate(minTime, .before = maxValue)
max_min_heart8 <- cbind(max_min_heart8, heartrate_seconds %>% count(Id))
max_min_heart8 <- max_min_heart8 %>% select(-6)
max_min_heart8 <- max_min_heart8 %>% rename(Count = n)
max_min_heart8
## Id maxTime minTime maxValue minValue
## 1 2022484408 5/9/2016 9:59:55 AM 4/12/2016 1:00:00 PM 203 38
## 2 2026352035 5/9/2016 7:49:45 PM 4/17/2016 5:30:20 AM 125 63
## 3 2347167796 4/29/2016 6:56:50 AM 4/12/2016 1:00:10 PM 195 49
## 4 4020332650 5/9/2016 9:59:59 PM 4/12/2016 1:00:00 AM 191 46
## 5 4388161847 5/9/2016 9:59:55 PM 4/13/2016 1:00:00 AM 180 39
## 6 4558609924 5/9/2016 9:59:55 AM 4/12/2016 1:00:00 PM 199 44
## 7 5553957443 5/9/2016 9:59:55 PM 4/12/2016 1:00:05 AM 165 47
## 8 5577150313 5/9/2016 9:59:50 PM 4/12/2016 1:00:00 AM 174 36
## 9 6117666160 5/9/2016 9:59:45 AM 4/15/2016 1:00:00 PM 189 52
## 10 6775888955 5/7/2016 9:59:55 AM 4/13/2016 10:00:00 PM 177 55
## 11 6962181067 5/9/2016 9:59:55 AM 4/12/2016 1:00:00 AM 184 47
## 12 7007744171 5/6/2016 9:59:50 AM 4/12/2016 1:00:00 PM 166 54
## 13 8792009665 5/4/2016 9:59:45 AM 4/12/2016 1:00:00 PM 158 43
## 14 8877689391 5/9/2016 9:59:56 PM 4/12/2016 1:00:05 PM 180 46
## Count
## 1 154104
## 2 2490
## 3 152683
## 4 285461
## 5 249748
## 6 192168
## 7 255174
## 8 248560
## 9 158899
## 10 32771
## 11 266326
## 12 133592
## 13 122841
## 14 228841
The max_min_heart8 basically contains a summary of each unique user ID that was in the heartrate_seconds dataset.It shows the max/min of both their respective heart rate value and the date of entry along with the number of occurrence of each ID in the heartrate_seconds dataset. I will use this dataset later for analysis.
The DailyActivity dataset is also big, so just I like did for the heartrate_seconds dataset, I will create another dataset that summarizes the dailyActivity dataset, however, this time I will only find the max of each column for each ID for now.
maxmin_dailyActivity <- dailyActivity %>% group_by(Id) %>% summarize(across(everything(), max))
glimpse(dailyActivity)
## Rows: 940
## Columns: 15
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ TotalSteps <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…
#filtering out all the unlogged days of activity
active_days <- filter(dailyActivity, dailyActivity$LoggedActivitiesDistance > 0)
length(active_days)
## [1] 15
#only 4 unique users had LoggedActivityDistance > 0
n_distinct(active_days$Id)
## [1] 4
We see that there was only a total of 15 entries that were manually logged and that only 4 distinct users had done so.
This might mean 2 things: either only 4/15 use this technology willingly or the device had an error causing the user to input the distance themselves. Or perhaps, the user were not wearing the device during their activity and therefore had to input the distance themselves.
What I want to do now is create a new dataset that contains all the information that I needed (or most), so instead of using a multiple datasets, I can just use one or two. First, I created a sleepTemp that will temporarily hold all the information I need from sleepDay while I try to merge it with dailyActivity.
sleepTemp <- sleepDay %>% separate(SleepDay, into = c("NewDate", "Hour", "AM_or_PM"), sep = " ") %>% rename
View(sleepTemp)
merged_data <- left_join(dailyActivity, sleepTemp, by = c('Id' = 'Id','ActivityDate' = 'NewDate'), relationship = "many-to-many")
merged_data <- merged_data %>% relocate(ActivityDate, .before = TotalSteps)
n_distinct(merged_data$Id)
## [1] 33
summary(merged_data)
## Id ActivityDate TotalSteps TotalDistance
## Min. :1.504e+09 Length:943 Min. : 0 Min. : 0.000
## 1st Qu.:2.320e+09 Class :character 1st Qu.: 3795 1st Qu.: 2.620
## Median :4.445e+09 Mode :character Median : 7439 Median : 5.260
## Mean :4.858e+09 Mean : 7652 Mean : 5.503
## 3rd Qu.:6.962e+09 3rd Qu.:10734 3rd Qu.: 7.720
## Max. :8.878e+09 Max. :36019 Max. :28.030
##
## TrackerDistance LoggedActivitiesDistance VeryActiveDistance
## Min. : 0.000 Min. :0.000 Min. : 0.000
## 1st Qu.: 2.620 1st Qu.:0.000 1st Qu.: 0.000
## Median : 5.260 Median :0.000 Median : 0.220
## Mean : 5.489 Mean :0.110 Mean : 1.504
## 3rd Qu.: 7.715 3rd Qu.:0.000 3rd Qu.: 2.065
## Max. :28.030 Max. :4.942 Max. :21.920
##
## ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
## Min. :0.0000 Min. : 0.000 Min. :0.000000
## 1st Qu.:0.0000 1st Qu.: 1.950 1st Qu.:0.000000
## Median :0.2400 Median : 3.380 Median :0.000000
## Mean :0.5709 Mean : 3.349 Mean :0.001601
## 3rd Qu.:0.8050 3rd Qu.: 4.790 3rd Qu.:0.000000
## Max. :6.4800 Max. :10.710 Max. :0.110000
##
## VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
## Min. : 0.00 Min. : 0.00 Min. : 0 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:127 1st Qu.: 729.0
## Median : 4.00 Median : 7.00 Median :199 Median :1057.0
## Mean : 21.24 Mean : 13.63 Mean :193 Mean : 990.4
## 3rd Qu.: 32.00 3rd Qu.: 19.00 3rd Qu.:264 3rd Qu.:1229.0
## Max. :210.00 Max. :143.00 Max. :518 Max. :1440.0
##
## Calories Hour AM_or_PM TotalSleepRecords
## Min. : 0 Length:943 Length:943 Min. :1.000
## 1st Qu.:1830 Class :character Class :character 1st Qu.:1.000
## Median :2140 Mode :character Mode :character Median :1.000
## Mean :2308 Mean :1.119
## 3rd Qu.:2796 3rd Qu.:1.000
## Max. :4900 Max. :3.000
## NA's :530
## TotalMinutesAsleep TotalTimeInBed TimeDiff
## Min. : 58.0 Min. : 61.0 Min. : 0.00
## 1st Qu.:361.0 1st Qu.:403.0 1st Qu.: 17.00
## Median :433.0 Median :463.0 Median : 25.00
## Mean :419.5 Mean :458.6 Mean : 39.17
## 3rd Qu.:490.0 3rd Qu.:526.0 3rd Qu.: 40.00
## Max. :796.0 Max. :961.0 Max. :371.00
## NA's :530 NA's :530 NA's :530
merged_data <- merged_data %>% mutate(Weekday = weekdays(mdy(ActivityDate)))
merged_data
## # A tibble: 943 × 22
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 13162 8.5 8.5
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## 7 1503960366 4/18/2016 13019 8.59 8.59
## 8 1503960366 4/19/2016 15506 9.88 9.88
## 9 1503960366 4/20/2016 10544 6.68 6.68
## 10 1503960366 4/21/2016 9819 6.34 6.34
## # ℹ 933 more rows
## # ℹ 17 more variables: LoggedActivitiesDistance <dbl>,
## # VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## # LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## # VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## # LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>,
## # Hour <chr>, AM_or_PM <chr>, TotalSleepRecords <dbl>, …
df <- data.frame(table(merged_data$Weekday))
ggplot(df) + geom_col(mapping=aes(x = factor(Var1, level = c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')),, y = Freq, fill = Var1)) + labs(title = "Amount Activity Per Date")
summary(df)
## Var1 Freq
## Friday :1 Min. :121.0
## Monday :1 1st Qu.:123.0
## Saturday :1 Median :126.0
## Sunday :1 Mean :134.7
## Thursday :1 3rd Qu.:149.0
## Tuesday :1 Max. :152.0
## Wednesday:1
Although there the graph does not show a large difference between the days, it is noticeable that the there are mostly active during Tuesday, Wednesday and Thursday, while they the least active day was on Monday.
I also want to see the difference when it comes to individuals in each aspect of the data. First, I looked at the steps discrepancy, while separating them by Id and Weekdays
After making the first try in graphing, I thought it would look better if I put in descending order, so I created another dataset to hold the data and put it in descending order. I will use this dataset to just use for the ordering of the facet_wrap.
I added used both the facet_grid and facet_wrap to visually represent the data in 2 different ways
hold_data <- merged_data %>% group_by(Id) %>% summarize(TotalStep = sum(TotalSteps))
hold_data <- hold_data %>% arrange(-TotalStep)
ggplot(merged_data ,mapping=aes(x= factor(Weekday, level=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')), y = TotalSteps, fill = Weekday)) + geom_col()+
facet_grid(~factor(Id, levels = unique(hold_data$Id))) + theme(axis.text.x = element_blank()) + labs(title = "Total Steps per Day for Each Id")
ggplot(merged_data ,mapping=aes(x= factor(Weekday, level=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')), y = TotalSteps, fill = Weekday)) + geom_col()+
facet_wrap(~factor(Id, levels = unique(hold_data$Id))) + theme(axis.text.x = element_blank()) + labs(title = "Total Steps per Day for Each Id")
ggplot(merged_data,mapping=aes(x= factor(Weekday, level=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')), y = TotalSteps, fill = Weekday)) + geom_col()+
theme(axis.text.x = element_blank()) + labs(title = "Total Steps per Day")
This bargraph is somewhat similar to the first graph we generated. We see that the Weekdays that have the most steps are the same as the Weekdays having the most entries. However, we see that the Weekday with the least amount of steps is on Saturday, while the the Weekday with the least entries was on Sunday. The most simple explanation would be that, while having less amount of entries, each entry or some entry had more steps when compared to the entries on Thursday. However, this is not a final assumption but just a guess.
Now, I want to see if sleep time or quality of sleep attributed to a higher or lower active day. First, I want to see if there was a relationship between the amount of Time idling in bed without sleeping and the amount steps walked for each entry.
ggplot(merged_data, mapping=aes(x = TimeDiff, y = TotalSteps)) + geom_point()+ labs(title = "Relationship between Time Awake in Bed and Total Steps") + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 530 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 530 rows containing missing values (`geom_point()`).
We can clearly see that there is no obvious relationship between the time slept and the TotalSteps walked. So we cannot say there is a relationship between these two statistics.
Now, I am curious if it causes a difference in the Total Minutes Active. Before I do this, I noticed that I did not have a TotalMin column in the dataset.
merged_data <- transform(merged_data, TotalMin=(VeryActiveMinutes+FairlyActiveMinutes+LightlyActiveMinutes+SedentaryMinutes))
merged_data <- merged_data %>% relocate(TotalMin, .before = TotalSteps)
ggplot(merged_data, mapping=aes(x = TimeDiff, y = TotalMin)) + geom_point()+ labs(title = "Relationship between Time Awake in Bed and Total Steps") + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 530 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 530 rows containing missing values (`geom_point()`).
Although we can see that there is somewhat a negative relationship (as in the more time you spend awake in bed, the less minutes you are active), there are a lot of outliers and the points are too crowded in one place. However, if we just look at the Time Spent Awake in Bed <100, the negative relationship is clear.
Now we compare, the Total minutes per day.
ggplot(merged_data,mapping=aes(x= factor(Weekday, level=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')), y = TotalMin, fill = Weekday)) + geom_col()+
theme(axis.text.x = element_blank()) + labs(title = "Total Minutes per Day")
This bargraph matches with the first bargraph, which compares theamount of entries per day, except for the fact that the smallest day was on Monday and not Sunday. It also somewhat matches with the second bargraph, the one that compares the TotalSteps per day, except for the fact that Saturay was the fourth most walked day while in this bargraph, it is on Friday. Perhaps, the longer walks explain the difference in between the first 2 bargraphs.
This leads me to question how does all of this compare? As in, how does all these stats (TotalMin,TotalSteps,TotalOccurence,TotalDistance) compare when looked at in perspective of the time spend awake in bed?
library(cowplot)
## Warning: package 'cowplot' was built under R version 4.3.3
##
## Attaching package: 'cowplot'
## The following object is masked from 'package:lubridate':
##
## stamp
View(merged_data)
TotalMinGraph <- ggplot(merged_data,mapping=aes(x= factor(Weekday, level=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')), y = TotalMin, fill = Weekday)) + geom_col()+
theme(axis.text.x = element_blank()) + labs(title = "Total Minutes per Day", x = NULL)
TotalStepsGraph <- ggplot(merged_data,mapping=aes(x= factor(Weekday, level=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')), y = TotalSteps, fill = Weekday)) + geom_col()+
theme(axis.text.x = element_blank()) + labs(title = "Total Steps per Day", x = NULL)
TotalDistGraph <- ggplot(merged_data,mapping=aes(x= factor(Weekday, level=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')), y = TotalDistance, fill = Weekday)) + geom_col()+
theme(axis.text.x = element_blank()) + labs(title = "Total Distance per Day", x = NULL)
df <- data.frame(table(merged_data$Weekday))
TotalActGraph <- ggplot(df) + geom_col(mapping=aes(x = factor(Var1, level = c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')),, y = Freq, fill = Var1)) + labs(title = "Amount Activity Per Date", x = NULL) + theme(axis.text.x = element_blank())
plot_grid(TotalMinGraph, TotalStepsGraph, TotalDistGraph, TotalActGraph, labels = "AUTO")
We see that when looking at the general order of the biggest to smallest, the Total Minutes graph and Amount Activity graph match with each other, while Total Steps and Total Distance match. Looking at these 4 graphs, we can confidently say that the days that the users are most active and the devices most used are on Tuesday, Wednesday, and Thursday. Not only that, the 2 least active days are Sunday and Monday while the other 2 days constantly remain somewhere in the middle.
Now, this leads to want to look at the graphs when compared to the Time Spent Awake in Bed.
library(cowplot)
TimeDiff_n_TotalSteps <- ggplot(merged_data, mapping=aes(x = TimeDiff, y = TotalSteps)) + geom_point()+ labs(title = "Relationship between Time Awake in Bed and Total Steps") + geom_smooth()
TimeDiff_n_TotalDist <- ggplot(merged_data, mapping=aes(x = TimeDiff, y = TotalDistance)) + geom_point()+labs(title = "Relationship between Time Awake in Bed and Total Distance") + geom_smooth()
TimeDiff_n_TotalMin <- ggplot(merged_data, mapping=aes(x= TimeDiff, y = TotalMin)) + geom_point()+labs(title= "Relationship between Time Awake in Bed and Total Minutes") + geom_smooth()
cor.test(merged_data$TimeDiff, merged_data$TotalMin)
##
## Pearson's product-moment correlation
##
## data: merged_data$TimeDiff and merged_data$TotalMin
## t = -4.4746, df = 411, p-value = 9.925e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.305665 -0.121561
## sample estimates:
## cor
## -0.2155274
cor.test(merged_data$TimeDiff, merged_data$TotalDistance)
##
## Pearson's product-moment correlation
##
## data: merged_data$TimeDiff and merged_data$TotalDistance
## t = 0.12105, df = 411, p-value = 0.9037
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.09057597 0.10240631
## sample estimates:
## cor
## 0.005970762
cor.test(merged_data$TimeDiff, merged_data$TotalSteps)
##
## Pearson's product-moment correlation
##
## data: merged_data$TimeDiff and merged_data$TotalSteps
## t = 0.54976, df = 411, p-value = 0.5828
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.06956893 0.12327966
## sample estimates:
## cor
## 0.02710758
plot_grid(TimeDiff_n_TotalSteps, TimeDiff_n_TotalDist, TimeDiff_n_TotalMin, labels = "AUTO")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 530 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 530 rows containing missing values (`geom_point()`).
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 530 rows containing non-finite values (`stat_smooth()`).
## Removed 530 rows containing missing values (`geom_point()`).
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 530 rows containing non-finite values (`stat_smooth()`).
## Removed 530 rows containing missing values (`geom_point()`).
Looking at the graphs, we cannot deduct a clear relationship between the statistics. What was interesting to me was that while some entries had a total of O minutes active, there were no entries with 0 steps or 0 distance. Perhaps this is the difference between the way the Total metrics were calculated as the Total Minutes was the only one that I had to manually create and calculate (I included the sedentary active minutes into my Total Minutes and perhaps they did not add those types into the other Total calculation of the statistics).
Looking at these graphs, I think it might be a stretch to say there is a connection with the amount of time spent awake in bed as there is no clear relationship. Although, one thing we could say is that there exists a negative relationship with Total Minutes spent active. In the graph between TimeDiff and TotalMin, the correlation score was -0.21, which the largest out of the three.
Maybe the amount slept has a clearer relationship? This time I will do the same as above, but instead use the TotalMinutesAsleep.
TimeDiff_n_TotalSteps <- ggplot(merged_data, mapping=aes(x = TotalMinutesAsleep, y = TotalSteps)) + geom_point()+ labs(title = "Relationship between Time Awake in Bed and Total Steps") + geom_smooth()
TimeDiff_n_TotalDist <- ggplot(merged_data, mapping=aes(x = TotalMinutesAsleep, y = TotalDistance)) + geom_point()+labs(title = "Relationship between Time Awake in Bed and Total Distance") + geom_smooth()
TimeDiff_n_TotalMin <- ggplot(merged_data, mapping=aes(x= TotalMinutesAsleep, y = TotalMin)) + geom_point()+labs(title= "Relationship between Time Awake in Bed and Total Minutes") + geom_smooth()
cor.test(merged_data$TotalMinutesAsleep, merged_data$TotalMin)
##
## Pearson's product-moment correlation
##
## data: merged_data$TotalMinutesAsleep and merged_data$TotalMin
## t = -16.456, df = 411, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.6850669 -0.5683003
## sample estimates:
## cor
## -0.6302341
cor.test(merged_data$TotalMinutesAsleep, merged_data$TotalDistance)
##
## Pearson's product-moment correlation
##
## data: merged_data$TotalMinutesAsleep and merged_data$TotalDistance
## t = -3.5428, df = 411, p-value = 0.0004414
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.26424789 -0.07692598
## sample estimates:
## cor
## -0.1721427
cor.test(merged_data$TotalMinutesAsleep, merged_data$TotalSteps)
##
## Pearson's product-moment correlation
##
## data: merged_data$TotalMinutesAsleep and merged_data$TotalSteps
## t = -3.8563, df = 411, p-value = 0.0001336
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.27834209 -0.09203143
## sample estimates:
## cor
## -0.1868665
plot_grid(TimeDiff_n_TotalSteps, TimeDiff_n_TotalDist, TimeDiff_n_TotalMin, labels = "AUTO")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 530 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 530 rows containing missing values (`geom_point()`).
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 530 rows containing non-finite values (`stat_smooth()`).
## Removed 530 rows containing missing values (`geom_point()`).
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 530 rows containing non-finite values (`stat_smooth()`).
## Removed 530 rows containing missing values (`geom_point()`).
Again, looking at this, there is no relationship between TotalMinutesAsleep and the TotalSteps and TotalDistance. However, there is a clear negative relationship between TotalMinutesAsleep and the Total Minute active (outside of a few outliers). This is a curious finding as it shows that the more they slept, the less minutes the users were active for.
The correlation score of TotalMinutesAsleep and TotalMin was -0.63, showing a somewhat strong negative correlation.
Now doing a similar thing, but this time with TotalMinutesInBed.
TimeDiff_n_TotalSteps <- ggplot(merged_data, mapping=aes(x = TotalTimeInBed, y = TotalSteps)) + geom_point()+ labs(title = "Relationship between Time Awake in Bed and Total Steps") + geom_smooth()
TimeDiff_n_TotalDist <- ggplot(merged_data, mapping=aes(x = TotalTimeInBed, y = TotalDistance)) + geom_point()+labs(title = "Relationship between Time Awake in Bed and Total Distance") + geom_smooth()
TimeDiff_n_TotalMin <- ggplot(merged_data, mapping=aes(x= TotalTimeInBed, y = TotalMin)) + geom_point()+labs(title= "Relationship between Time Awake in Bed and Total Minutes") + geom_smooth()
cor.test(merged_data$TotalTimeInBed, merged_data$TotalMin)
##
## Pearson's product-moment correlation
##
## data: merged_data$TotalTimeInBed and merged_data$TotalMin
## t = -18.09, df = 411, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7162610 -0.6083721
## sample estimates:
## cor
## -0.6657822
cor.test(merged_data$TotalTimeInBed, merged_data$TotalDistance)
##
## Pearson's product-moment correlation
##
## data: merged_data$TotalTimeInBed and merged_data$TotalDistance
## t = -3.2459, df = 411, p-value = 0.001267
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.25076397 -0.06255465
## sample estimates:
## cor
## -0.1580949
cor.test(merged_data$TotalTimeInBed, merged_data$TotalSteps)
##
## Pearson's product-moment correlation
##
## data: merged_data$TotalTimeInBed and merged_data$TotalSteps
## t = -3.3717, df = 411, p-value = 0.0008178
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.25649374 -0.06865199
## sample estimates:
## cor
## -0.1640597
plot_grid(TimeDiff_n_TotalSteps, TimeDiff_n_TotalDist, TimeDiff_n_TotalMin, labels = "AUTO")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 530 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 530 rows containing missing values (`geom_point()`).
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 530 rows containing non-finite values (`stat_smooth()`).
## Removed 530 rows containing missing values (`geom_point()`).
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 530 rows containing non-finite values (`stat_smooth()`).
## Removed 530 rows containing missing values (`geom_point()`).