Bellaeat is fitness technology company. It produces different wearables for women to track their wellness activity. In this project we are going to analyze fitness data to provide insight for improving Bellabeat’s marketing strategy. ASK : Trying to identify trends in smart device usage. Prepare and Process : We are analyzing FitBit Trcker Data available on kaggle. It has data from 33 users. The sample size that is number of users is too small which can potentilly give skewed results. Ideally, as a data analyst we would have asked company to provide data for more users to have appropriate sampl size. This would have helped in gaining unskewed and unbiased result. However, its not possible, so we will proceed with the data available.

CLEANING AND ORGANIZING DATA We are using 3 datasets : daily_activity sleep_day weight_LogInfo

Importing datasets: dailyActivity_merged, sleep_Day, minute_Mets and weight_LogInfo

dailyActivity_merged <- read.csv("C:/Users/PC/Desktop/mahwash/archive/Fitabase_B/dailyActivity_merged.csv")
sleep_Day <- read.csv("C:/Users/PC/Desktop/mahwash/archive/Fitabase_B/sleepDay_merged.csv")
weight_LogInfo <- read.csv("C:/Users/PC/Desktop/mahwash/archive/Fitabase_B/weightLogInfo_merged.csv")
minute_METs <- read.csv("C:/Users/PC/Desktop/mahwash/archive/Fitabase_B/minuteMETsNarrow_merged.csv")
library("ggplot2")
library("tidyr")
library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Checking Dataset structure

head(dailyActivity_merged)
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366    4/12/2016      13162          8.50            8.50
## 2 1503960366    4/13/2016      10735          6.97            6.97
## 3 1503960366    4/14/2016      10460          6.74            6.74
## 4 1503960366    4/15/2016       9762          6.28            6.28
## 5 1503960366    4/16/2016      12669          8.16            8.16
## 6 1503960366    4/17/2016       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728
colnames(dailyActivity_merged)
##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"

There are 15 variables or column in dailyActivity_Merged dataset.

str(dailyActivity_merged)
## 'data.frame':    940 obs. of  15 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
##  $ TotalDistance           : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ Calories                : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...

checking for NA Values

anyNA(dailyActivity_merged)
## [1] FALSE

There is no NA value.

Checking total number of observations

nrow(dailyActivity_merged)
## [1] 940

There are 940 observations for 33 users.

head(sleep_Day)
##           Id              SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304
##   TotalTimeInBed
## 1            346
## 2            407
## 3            442
## 4            367
## 5            712
## 6            320
colnames(sleep_Day)
## [1] "Id"                 "SleepDay"           "TotalSleepRecords" 
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
str(sleep_Day)
## 'data.frame':    413 obs. of  5 variables:
##  $ Id                : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ SleepDay          : chr  "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
##  $ TotalSleepRecords : int  1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: int  327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : int  346 407 442 367 712 320 377 364 384 449 ...
head(weight_LogInfo)
##           Id                  Date WeightKg WeightPounds Fat   BMI
## 1 1503960366  5/2/2016 11:59:59 PM     52.6     115.9631  22 22.65
## 2 1503960366  5/3/2016 11:59:59 PM     52.6     115.9631  NA 22.65
## 3 1927972279  4/13/2016 1:08:52 AM    133.5     294.3171  NA 47.54
## 4 2873212765 4/21/2016 11:59:59 PM     56.7     125.0021  NA 21.45
## 5 2873212765 5/12/2016 11:59:59 PM     57.3     126.3249  NA 21.69
## 6 4319703577 4/17/2016 11:59:59 PM     72.4     159.6147  25 27.45
##   IsManualReport        LogId
## 1           True 1.462234e+12
## 2           True 1.462320e+12
## 3          False 1.460510e+12
## 4           True 1.461283e+12
## 5           True 1.463098e+12
## 6           True 1.460938e+12
colnames(weight_LogInfo)
## [1] "Id"             "Date"           "WeightKg"       "WeightPounds"  
## [5] "Fat"            "BMI"            "IsManualReport" "LogId"
str(weight_LogInfo)
## 'data.frame':    67 obs. of  8 variables:
##  $ Id            : num  1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
##  $ Date          : chr  "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
##  $ WeightKg      : num  52.6 52.6 133.5 56.7 57.3 ...
##  $ WeightPounds  : num  116 116 294 125 126 ...
##  $ Fat           : int  22 NA NA NA NA 25 NA NA NA NA ...
##  $ BMI           : num  22.6 22.6 47.5 21.5 21.7 ...
##  $ IsManualReport: chr  "True" "True" "False" "True" ...
##  $ LogId         : num  1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...
head(minute_METs)
##           Id        ActivityMinute METs
## 1 1503960366 4/12/2016 12:00:00 AM   10
## 2 1503960366 4/12/2016 12:01:00 AM   10
## 3 1503960366 4/12/2016 12:02:00 AM   10
## 4 1503960366 4/12/2016 12:03:00 AM   10
## 5 1503960366 4/12/2016 12:04:00 AM   10
## 6 1503960366 4/12/2016 12:05:00 AM   12
colnames(minute_METs)
## [1] "Id"             "ActivityMinute" "METs"
str(minute_METs)
## 'data.frame':    1325580 obs. of  3 variables:
##  $ Id            : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityMinute: chr  "4/12/2016 12:00:00 AM" "4/12/2016 12:01:00 AM" "4/12/2016 12:02:00 AM" "4/12/2016 12:03:00 AM" ...
##  $ METs          : int  10 10 10 10 10 12 12 12 12 12 ...

Analyzing relationship between total steps taken and calories

ggplot(data = dailyActivity_merged, aes(x=TotalSteps, y=Calories)) +
  geom_point() +
   stat_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'

Correlation between total steps and calories

cor(dailyActivity_merged$TotalSteps, dailyActivity_merged$Calories, method = "pearson")
## [1] 0.5915681

Relationship between sedentary minutes and calories

ggplot(data = dailyActivity_merged, aes(x=SedentaryMinutes, y=Calories)) +
  geom_point() +
  stat_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'

## Correlation between Sedentary minutes and calories

cor(dailyActivity_merged$SedentaryMinutes, dailyActivity_merged$Calories, method = "pearson")
## [1] -0.106973

Relationship between Very active minutes and total duration of sleep

We need to merge daily_activity and sleep_Day dataset. First, we shall be separating time and date into two columns in sleep_Day dataset an then will use left_join because number of rows are different in daily_activity and sleep_Day dataset.

Sleep_day1 <- separate(sleep_Day, SleepDay, into = c("ActivityDate", "time"), sep = " ")
## Warning: Expected 2 pieces. Additional pieces discarded in 413 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
Sleep_day2 <- Sleep_day1 %>%
  select(-time)
dailyact_sleep <- left_join(dailyActivity_merged, Sleep_day2)
## Joining, by = c("Id", "ActivityDate")

checking for any NA values

anyNA(dailyact_sleep)
## [1] TRUE

Dropping NA values

dailyact_sleep1 <- dailyact_sleep %>%
  select(VeryActiveMinutes, TotalMinutesAsleep) %>%
  filter(!is.na(VeryActiveMinutes), !is.na(TotalMinutesAsleep))
ggplot(data = dailyact_sleep1, aes(x=VeryActiveMinutes, y=TotalMinutesAsleep)) +
  geom_point() +
  stat_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'

correlation between Very active minutes and Total duration of sleep

cor(dailyact_sleep1$VeryActiveMinutes, dailyact_sleep1$TotalMinutesAsleep, method = "pearson")
## [1] -0.09043628

Correlation between BMI and MET

We will first separate ActivityMinute column in MET dataset into two different Date and Time columns and Date column in weight_LogInfo dataset into Date and Time columns. Then left_join is used to merge these two datasets.

met_act <- separate(minute_METs, ActivityMinute, into = c("Date", "time"), sep = " ")
## Warning: Expected 2 pieces. Additional pieces discarded in 1325580 rows [1, 2,
## 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
weight_log <- separate(weight_LogInfo,Date, into = c("Date", "time"), sep = " ")
## Warning: Expected 2 pieces. Additional pieces discarded in 67 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
weight_log1 <- weight_log %>%
  select(-time)
met_weight <- left_join(met_act, weight_log1)
## Joining, by = c("Id", "Date")
met_weight1 <- met_weight %>%
  select(METs, BMI) %>%
  filter(!is.na(METs), !is.na(BMI))
ggplot(data = met_weight1, aes(x=METs, y=BMI)) +
  geom_point() +
  stat_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'

Correlation between METs and BMI

cor(met_weight1$METs, met_weight1$BMI, method = "pearson")
## [1] -0.02574604

Insights There is a positive relationship between totalsteps taken and calories. So we can modify our leaf gadget to give daily notification if we do not walk everyday. There is negative relationship between sedentary minutes and calories. We can improve our gadget to give daily reminders to be more active if our sedentary minutes increase. The realation between very active minutes and total duration of sleep is also negative. This can be interpreted as if our sleeping hour is going up, our very active minutes will go down. There should be recommendation in the gadget as to how many hours an individual needs to sleep without comprisin on his or her very active minutes. Also, METs and BMI are negatively related. So a personalized recommendation should be made available in the gadgets as to how we shold divide our activity in terms of MET to regulate our BMI.