Bellaeat is fitness technology company. It produces different wearables for women to track their wellness activity. In this project we are going to analyze fitness data to provide insight for improving Bellabeat’s marketing strategy. ASK : Trying to identify trends in smart device usage. Prepare and Process : We are analyzing FitBit Trcker Data available on kaggle. It has data from 33 users. The sample size that is number of users is too small which can potentilly give skewed results. Ideally, as a data analyst we would have asked company to provide data for more users to have appropriate sampl size. This would have helped in gaining unskewed and unbiased result. However, its not possible, so we will proceed with the data available.
CLEANING AND ORGANIZING DATA We are using 3 datasets : daily_activity sleep_day weight_LogInfo
dailyActivity_merged <- read.csv("C:/Users/PC/Desktop/mahwash/archive/Fitabase_B/dailyActivity_merged.csv")
sleep_Day <- read.csv("C:/Users/PC/Desktop/mahwash/archive/Fitabase_B/sleepDay_merged.csv")
weight_LogInfo <- read.csv("C:/Users/PC/Desktop/mahwash/archive/Fitabase_B/weightLogInfo_merged.csv")
minute_METs <- read.csv("C:/Users/PC/Desktop/mahwash/archive/Fitabase_B/minuteMETsNarrow_merged.csv")
library("ggplot2")
library("tidyr")
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
head(dailyActivity_merged)
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366 4/12/2016 13162 8.50 8.50
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 1.88 0.55
## 2 0 1.57 0.69
## 3 0 2.44 0.40
## 4 0 2.14 1.26
## 5 0 2.71 0.41
## 6 0 3.19 0.78
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 6.06 0 25
## 2 4.71 0 21
## 3 3.91 0 30
## 4 2.83 0 29
## 5 5.04 0 36
## 6 2.51 0 38
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 13 328 728 1985
## 2 19 217 776 1797
## 3 11 181 1218 1776
## 4 34 209 726 1745
## 5 10 221 773 1863
## 6 20 164 539 1728
colnames(dailyActivity_merged)
## [1] "Id" "ActivityDate"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories"
There are 15 variables or column in dailyActivity_Merged dataset.
str(dailyActivity_merged)
## 'data.frame': 940 obs. of 15 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDate : chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ TotalSteps : int 13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
## $ TotalDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : int 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : int 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : int 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : int 728 776 1218 726 773 539 1149 775 818 838 ...
## $ Calories : int 1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
anyNA(dailyActivity_merged)
## [1] FALSE
There is no NA value.
nrow(dailyActivity_merged)
## [1] 940
There are 940 observations for 33 users.
head(sleep_Day)
## Id SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM 1 327
## 2 1503960366 4/13/2016 12:00:00 AM 2 384
## 3 1503960366 4/15/2016 12:00:00 AM 1 412
## 4 1503960366 4/16/2016 12:00:00 AM 2 340
## 5 1503960366 4/17/2016 12:00:00 AM 1 700
## 6 1503960366 4/19/2016 12:00:00 AM 1 304
## TotalTimeInBed
## 1 346
## 2 407
## 3 442
## 4 367
## 5 712
## 6 320
colnames(sleep_Day)
## [1] "Id" "SleepDay" "TotalSleepRecords"
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
str(sleep_Day)
## 'data.frame': 413 obs. of 5 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ SleepDay : chr "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
## $ TotalSleepRecords : int 1 2 1 2 1 1 1 1 1 1 ...
## $ TotalMinutesAsleep: int 327 384 412 340 700 304 360 325 361 430 ...
## $ TotalTimeInBed : int 346 407 442 367 712 320 377 364 384 449 ...
head(weight_LogInfo)
## Id Date WeightKg WeightPounds Fat BMI
## 1 1503960366 5/2/2016 11:59:59 PM 52.6 115.9631 22 22.65
## 2 1503960366 5/3/2016 11:59:59 PM 52.6 115.9631 NA 22.65
## 3 1927972279 4/13/2016 1:08:52 AM 133.5 294.3171 NA 47.54
## 4 2873212765 4/21/2016 11:59:59 PM 56.7 125.0021 NA 21.45
## 5 2873212765 5/12/2016 11:59:59 PM 57.3 126.3249 NA 21.69
## 6 4319703577 4/17/2016 11:59:59 PM 72.4 159.6147 25 27.45
## IsManualReport LogId
## 1 True 1.462234e+12
## 2 True 1.462320e+12
## 3 False 1.460510e+12
## 4 True 1.461283e+12
## 5 True 1.463098e+12
## 6 True 1.460938e+12
colnames(weight_LogInfo)
## [1] "Id" "Date" "WeightKg" "WeightPounds"
## [5] "Fat" "BMI" "IsManualReport" "LogId"
str(weight_LogInfo)
## 'data.frame': 67 obs. of 8 variables:
## $ Id : num 1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
## $ Date : chr "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
## $ WeightKg : num 52.6 52.6 133.5 56.7 57.3 ...
## $ WeightPounds : num 116 116 294 125 126 ...
## $ Fat : int 22 NA NA NA NA 25 NA NA NA NA ...
## $ BMI : num 22.6 22.6 47.5 21.5 21.7 ...
## $ IsManualReport: chr "True" "True" "False" "True" ...
## $ LogId : num 1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...
head(minute_METs)
## Id ActivityMinute METs
## 1 1503960366 4/12/2016 12:00:00 AM 10
## 2 1503960366 4/12/2016 12:01:00 AM 10
## 3 1503960366 4/12/2016 12:02:00 AM 10
## 4 1503960366 4/12/2016 12:03:00 AM 10
## 5 1503960366 4/12/2016 12:04:00 AM 10
## 6 1503960366 4/12/2016 12:05:00 AM 12
colnames(minute_METs)
## [1] "Id" "ActivityMinute" "METs"
str(minute_METs)
## 'data.frame': 1325580 obs. of 3 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityMinute: chr "4/12/2016 12:00:00 AM" "4/12/2016 12:01:00 AM" "4/12/2016 12:02:00 AM" "4/12/2016 12:03:00 AM" ...
## $ METs : int 10 10 10 10 10 12 12 12 12 12 ...
ggplot(data = dailyActivity_merged, aes(x=TotalSteps, y=Calories)) +
geom_point() +
stat_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
cor(dailyActivity_merged$TotalSteps, dailyActivity_merged$Calories, method = "pearson")
## [1] 0.5915681
ggplot(data = dailyActivity_merged, aes(x=SedentaryMinutes, y=Calories)) +
geom_point() +
stat_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
## Correlation between Sedentary minutes and calories
cor(dailyActivity_merged$SedentaryMinutes, dailyActivity_merged$Calories, method = "pearson")
## [1] -0.106973
Sleep_day1 <- separate(sleep_Day, SleepDay, into = c("ActivityDate", "time"), sep = " ")
## Warning: Expected 2 pieces. Additional pieces discarded in 413 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
Sleep_day2 <- Sleep_day1 %>%
select(-time)
dailyact_sleep <- left_join(dailyActivity_merged, Sleep_day2)
## Joining, by = c("Id", "ActivityDate")
anyNA(dailyact_sleep)
## [1] TRUE
dailyact_sleep1 <- dailyact_sleep %>%
select(VeryActiveMinutes, TotalMinutesAsleep) %>%
filter(!is.na(VeryActiveMinutes), !is.na(TotalMinutesAsleep))
ggplot(data = dailyact_sleep1, aes(x=VeryActiveMinutes, y=TotalMinutesAsleep)) +
geom_point() +
stat_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
cor(dailyact_sleep1$VeryActiveMinutes, dailyact_sleep1$TotalMinutesAsleep, method = "pearson")
## [1] -0.09043628
We will first separate ActivityMinute column in MET dataset into two different Date and Time columns and Date column in weight_LogInfo dataset into Date and Time columns. Then left_join is used to merge these two datasets.
met_act <- separate(minute_METs, ActivityMinute, into = c("Date", "time"), sep = " ")
## Warning: Expected 2 pieces. Additional pieces discarded in 1325580 rows [1, 2,
## 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
weight_log <- separate(weight_LogInfo,Date, into = c("Date", "time"), sep = " ")
## Warning: Expected 2 pieces. Additional pieces discarded in 67 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
weight_log1 <- weight_log %>%
select(-time)
met_weight <- left_join(met_act, weight_log1)
## Joining, by = c("Id", "Date")
met_weight1 <- met_weight %>%
select(METs, BMI) %>%
filter(!is.na(METs), !is.na(BMI))
ggplot(data = met_weight1, aes(x=METs, y=BMI)) +
geom_point() +
stat_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
cor(met_weight1$METs, met_weight1$BMI, method = "pearson")
## [1] -0.02574604
Insights There is a positive relationship between totalsteps taken and calories. So we can modify our leaf gadget to give daily notification if we do not walk everyday. There is negative relationship between sedentary minutes and calories. We can improve our gadget to give daily reminders to be more active if our sedentary minutes increase. The realation between very active minutes and total duration of sleep is also negative. This can be interpreted as if our sleeping hour is going up, our very active minutes will go down. There should be recommendation in the gadget as to how many hours an individual needs to sleep without comprisin on his or her very active minutes. Also, METs and BMI are negatively related. So a personalized recommendation should be made available in the gadgets as to how we shold divide our activity in terms of MET to regulate our BMI.