This is the capstone project for the Google Data Analytics Certification.For this case study, I am tasked with assisting a wearable fitness technology company, Bellabeat, improve their marketing strategies for their products by investigating customer activity with other fitness trackers like FitBit.
My goal is to look at datasets to find out:
How are customers using other fitness trackers, in their daily life?
What particular features seem to be the most heavily used?
What features do Bellabeat products already have that consumers want, and how do we focus marketing on those aspects?
I will be using tidyverse package as well as the skimr, here, and janitor packages for help with this project.
We’re also using the sqldf package, which will allow us to emulate SQL syntax when looking at data
install.packages("sqldf",repos = "http://cran.us.r-project.org")
## package 'sqldf' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Asus\AppData\Local\Temp\RtmpiOaUyZ\downloaded_packages
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.7
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.1.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(skimr)
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
The data that is provided is FitBit Fitness Tracker Data. This dataset has 18 different csv files that range from Daily activity, calories, steps; hourly calories, intensities, and steps; and heart rate, sleep data and weight logs. After looking at the types of data collected by these 30 fitbit users, i discovered: * No water intake data has been collected
The data frames I’ll be working with in this case study will be creating objects for:
daily_activity
daily calories
daily sleep
weight log info
daily intensities
daily_activity <- read.csv("C:/Users/Asus/OneDrive/Documents/fitbit_base/Fitabase_Data/dailyActivity_merged.csv")
daily_calories <- read.csv("C:/Users/Asus/OneDrive/Documents/fitbit_base/Fitabase_Data/dailyCalories_merged.csv")
sleep_day <- read.csv("C:/Users/Asus/OneDrive/Documents/fitbit_base/Fitabase_Data/sleepDay_merged.csv")
daily_intensities <- read.csv("C:/Users/Asus/OneDrive/Documents/fitbit_base/Fitabase_Data/dailyIntensities_merged.csv")
weight_log <- read.csv("C:/Users/Asus/OneDrive/Documents/fitbit_base/Fitabase_Data/weightLogInfo_merged.csv")
head(daily_activity)
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366 4/12/2016 13162 8.50 8.50
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 1.88 0.55
## 2 0 1.57 0.69
## 3 0 2.44 0.40
## 4 0 2.14 1.26
## 5 0 2.71 0.41
## 6 0 3.19 0.78
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 6.06 0 25
## 2 4.71 0 21
## 3 3.91 0 30
## 4 2.83 0 29
## 5 5.04 0 36
## 6 2.51 0 38
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 13 328 728 1985
## 2 19 217 776 1797
## 3 11 181 1218 1776
## 4 34 209 726 1745
## 5 10 221 773 1863
## 6 20 164 539 1728
colnames(daily_activity)
## [1] "Id" "ActivityDate"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories"
glimpse(daily_activity)
## Rows: 940
## Columns: 15
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036~
## $ ActivityDate <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/~
## $ TotalSteps <int> 13162, 10735, 10460, 9762, 12669, 9705, 13019~
## $ TotalDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ TrackerDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5~
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3~
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0~
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveMinutes <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4~
## $ FairlyActiveMinutes <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21~
## $ LightlyActiveMinutes <int> 328, 217, 181, 209, 221, 164, 233, 264, 205, ~
## $ SedentaryMinutes <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 818~
## $ Calories <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203~
head(daily_calories)
## Id ActivityDay Calories
## 1 1503960366 4/12/2016 1985
## 2 1503960366 4/13/2016 1797
## 3 1503960366 4/14/2016 1776
## 4 1503960366 4/15/2016 1745
## 5 1503960366 4/16/2016 1863
## 6 1503960366 4/17/2016 1728
colnames(daily_calories)
## [1] "Id" "ActivityDay" "Calories"
glimpse(daily_calories)
## Rows: 940
## Columns: 3
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960366~
## $ ActivityDay <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/2016", "4/16/~
## $ Calories <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2035, 1786, 1775~
head(sleep_day)
## Id SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM 1 327
## 2 1503960366 4/13/2016 12:00:00 AM 2 384
## 3 1503960366 4/15/2016 12:00:00 AM 1 412
## 4 1503960366 4/16/2016 12:00:00 AM 2 340
## 5 1503960366 4/17/2016 12:00:00 AM 1 700
## 6 1503960366 4/19/2016 12:00:00 AM 1 304
## TotalTimeInBed
## 1 346
## 2 407
## 3 442
## 4 367
## 5 712
## 6 320
colnames(sleep_day)
## [1] "Id" "SleepDay" "TotalSleepRecords"
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
glimpse(sleep_day)
## Rows: 413
## Columns: 5
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150~
## $ SleepDay <chr> "4/12/2016 12:00:00 AM", "4/13/2016 12:00:00 AM", "~
## $ TotalSleepRecords <int> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ TotalMinutesAsleep <int> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2~
## $ TotalTimeInBed <int> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3~
head(daily_intensities)
## Id ActivityDay SedentaryMinutes LightlyActiveMinutes
## 1 1503960366 4/12/2016 728 328
## 2 1503960366 4/13/2016 776 217
## 3 1503960366 4/14/2016 1218 181
## 4 1503960366 4/15/2016 726 209
## 5 1503960366 4/16/2016 773 221
## 6 1503960366 4/17/2016 539 164
## FairlyActiveMinutes VeryActiveMinutes SedentaryActiveDistance
## 1 13 25 0
## 2 19 21 0
## 3 11 30 0
## 4 34 29 0
## 5 10 36 0
## 6 20 38 0
## LightActiveDistance ModeratelyActiveDistance VeryActiveDistance
## 1 6.06 0.55 1.88
## 2 4.71 0.69 1.57
## 3 3.91 0.40 2.44
## 4 2.83 1.26 2.14
## 5 5.04 0.41 2.71
## 6 2.51 0.78 3.19
colnames(daily_intensities)
## [1] "Id" "ActivityDay"
## [3] "SedentaryMinutes" "LightlyActiveMinutes"
## [5] "FairlyActiveMinutes" "VeryActiveMinutes"
## [7] "SedentaryActiveDistance" "LightActiveDistance"
## [9] "ModeratelyActiveDistance" "VeryActiveDistance"
glimpse(daily_intensities)
## Rows: 940
## Columns: 10
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036~
## $ ActivityDay <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/~
## $ SedentaryMinutes <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 818~
## $ LightlyActiveMinutes <int> 328, 217, 181, 209, 221, 164, 233, 264, 205, ~
## $ FairlyActiveMinutes <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21~
## $ VeryActiveMinutes <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4~
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0~
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3~
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5~
head(weight_log)
## Id Date WeightKg WeightPounds Fat BMI
## 1 1503960366 5/2/2016 11:59:59 PM 52.6 115.9631 22 22.65
## 2 1503960366 5/3/2016 11:59:59 PM 52.6 115.9631 NA 22.65
## 3 1927972279 4/13/2016 1:08:52 AM 133.5 294.3171 NA 47.54
## 4 2873212765 4/21/2016 11:59:59 PM 56.7 125.0021 NA 21.45
## 5 2873212765 5/12/2016 11:59:59 PM 57.3 126.3249 NA 21.69
## 6 4319703577 4/17/2016 11:59:59 PM 72.4 159.6147 25 27.45
## IsManualReport LogId
## 1 True 1.462234e+12
## 2 True 1.462320e+12
## 3 False 1.460510e+12
## 4 True 1.461283e+12
## 5 True 1.463098e+12
## 6 True 1.460938e+12
colnames(weight_log)
## [1] "Id" "Date" "WeightKg" "WeightPounds"
## [5] "Fat" "BMI" "IsManualReport" "LogId"
glimpse(weight_log)
## Rows: 67
## Columns: 8
## $ Id <dbl> 1503960366, 1503960366, 1927972279, 2873212765, 2873212~
## $ Date <chr> "5/2/2016 11:59:59 PM", "5/3/2016 11:59:59 PM", "4/13/2~
## $ WeightKg <dbl> 52.6, 52.6, 133.5, 56.7, 57.3, 72.4, 72.3, 69.7, 70.3, ~
## $ WeightPounds <dbl> 115.9631, 115.9631, 294.3171, 125.0021, 126.3249, 159.6~
## $ Fat <int> 22, NA, NA, NA, NA, 25, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ BMI <dbl> 22.65, 22.65, 47.54, 21.45, 21.69, 27.45, 27.38, 27.25,~
## $ IsManualReport <chr> "True", "True", "False", "True", "True", "True", "True"~
## $ LogId <dbl> 1.462234e+12, 1.462320e+12, 1.460510e+12, 1.461283e+12,~
After exploring these tables i observed a few things:
Merging of the data frames is possible because they all have same ‘ID’ fields
the daily_activity, daily_calories, and daily_intensities have the exact (940) number of observations.
Lastly the daily_activity table might have a log of calories and intensities already, so we should confirm that the values actually match for any given ‘ID’ number.
To confirm the last point I am running the following codes:
daily_activity2 <- daily_activity %>%
select(Id, ActivityDate, Calories)
head(daily_activity2)
## Id ActivityDate Calories
## 1 1503960366 4/12/2016 1985
## 2 1503960366 4/13/2016 1797
## 3 1503960366 4/14/2016 1776
## 4 1503960366 4/15/2016 1745
## 5 1503960366 4/16/2016 1863
## 6 1503960366 4/17/2016 1728
sql_check1 <- sqldf('SELECT * FROM daily_activity2 INTERSECT SELECT * FROM daily_calories')
head(sql_check1)
## Id ActivityDate Calories
## 1 1503960366 4/12/2016 1985
## 2 1503960366 4/13/2016 1797
## 3 1503960366 4/14/2016 1776
## 4 1503960366 4/15/2016 1745
## 5 1503960366 4/16/2016 1863
## 6 1503960366 4/17/2016 1728
nrow(sql_check1)
## [1] 940
From the above codes we can say that since the first six values of daily_activity and daily_calories are same and total observation of the sql query is 940 the values are the same between the dataframes.
My gut says, daily_intensities and daily_activities will also intersect but I would not create bias and will confirm it.
daily_activity3 <- daily_activity %>%
select(Id, ActivityDate, SedentaryMinutes, LightlyActiveMinutes, FairlyActiveMinutes, VeryActiveMinutes, SedentaryActiveDistance, LightActiveDistance, ModeratelyActiveDistance, VeryActiveDistance)
head(daily_activity3)
## Id ActivityDate SedentaryMinutes LightlyActiveMinutes
## 1 1503960366 4/12/2016 728 328
## 2 1503960366 4/13/2016 776 217
## 3 1503960366 4/14/2016 1218 181
## 4 1503960366 4/15/2016 726 209
## 5 1503960366 4/16/2016 773 221
## 6 1503960366 4/17/2016 539 164
## FairlyActiveMinutes VeryActiveMinutes SedentaryActiveDistance
## 1 13 25 0
## 2 19 21 0
## 3 11 30 0
## 4 34 29 0
## 5 10 36 0
## 6 20 38 0
## LightActiveDistance ModeratelyActiveDistance VeryActiveDistance
## 1 6.06 0.55 1.88
## 2 4.71 0.69 1.57
## 3 3.91 0.40 2.44
## 4 2.83 1.26 2.14
## 5 5.04 0.41 2.71
## 6 2.51 0.78 3.19
sql_check2 <- sqldf('SELECT * FROM daily_activity3 INTERSECT SELECT * FROM daily_intensities')
head(sql_check2)
## Id ActivityDate SedentaryMinutes LightlyActiveMinutes
## 1 1503960366 4/12/2016 728 328
## 2 1503960366 4/13/2016 776 217
## 3 1503960366 4/14/2016 1218 181
## 4 1503960366 4/15/2016 726 209
## 5 1503960366 4/16/2016 773 221
## 6 1503960366 4/17/2016 539 164
## FairlyActiveMinutes VeryActiveMinutes SedentaryActiveDistance
## 1 13 25 0
## 2 19 21 0
## 3 11 30 0
## 4 34 29 0
## 5 10 36 0
## 6 20 38 0
## LightActiveDistance ModeratelyActiveDistance VeryActiveDistance
## 1 6.06 0.55 1.88
## 2 4.71 0.69 1.57
## 3 3.91 0.40 2.44
## 4 2.83 1.26 2.14
## 5 5.04 0.41 2.71
## 6 2.51 0.78 3.19
nrow(sql_check2)
## [1] 940
Seems like my gut was correct! but its always better to check, see and confirm with the data.
This means I can carry out my analysis with just the 3 different data frames: * daily_activity * sleep_day * weight_log
Since I have done my preparation and pre-processing. Now I will do the analysis
Since the number of observation in daily_activity is a lot more tha weight_log and sleep_day dataframes. There are chances there might be more id’s in the daily_activity dataframe. Lets confirm this.
n_distinct(daily_activity$Id)
## [1] 33
n_distinct(sleep_day$Id)
## [1] 24
n_distinct(weight_log$Id)
## [1] 8
Lets check once again the total number of observations in each dataframe
nrow(daily_activity)
## [1] 940
nrow(sleep_day)
## [1] 413
nrow(weight_log)
## [1] 67
Quick summary of all the dataframes: * daily_activity
daily_activity %>%
select(TotalSteps,
TotalDistance,
SedentaryMinutes,
VeryActiveMinutes) %>%
summary()
## TotalSteps TotalDistance SedentaryMinutes VeryActiveMinutes
## Min. : 0 Min. : 0.000 Min. : 0.0 Min. : 0.00
## 1st Qu.: 3790 1st Qu.: 2.620 1st Qu.: 729.8 1st Qu.: 0.00
## Median : 7406 Median : 5.245 Median :1057.5 Median : 4.00
## Mean : 7638 Mean : 5.490 Mean : 991.2 Mean : 21.16
## 3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.:1229.5 3rd Qu.: 32.00
## Max. :36019 Max. :28.030 Max. :1440.0 Max. :210.00
sleep_day %>%
select(TotalSleepRecords,
TotalMinutesAsleep,
TotalTimeInBed) %>%
summary()
## TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## Min. :1.000 Min. : 58.0 Min. : 61.0
## 1st Qu.:1.000 1st Qu.:361.0 1st Qu.:403.0
## Median :1.000 Median :433.0 Median :463.0
## Mean :1.119 Mean :419.5 Mean :458.6
## 3rd Qu.:1.000 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :3.000 Max. :796.0 Max. :961.0
weight_log %>%
select(WeightPounds,
BMI) %>%
summary()
## WeightPounds BMI
## Min. :116.0 Min. :21.45
## 1st Qu.:135.4 1st Qu.:23.96
## Median :137.8 Median :24.39
## Mean :158.8 Mean :25.19
## 3rd Qu.:187.5 3rd Qu.:25.56
## Max. :294.3 Max. :47.54
Just by seeing the heads and a quick summary one can’t see the full trend of the dataframes. So I will plot some graphs to see the relationship properly.
I would like to start with the relationship between steps taken in a da and sedentary(people were inactive) minutes
ggplot(data=daily_activity, aes(x=TotalSteps, y=SedentaryMinutes, color = Calories)) + geom_point()
We can see there is a negative relation between total steps and sedentary minutes which is true also because one doesn’t move when he/she is inactive!
So we can easily market this to consumers by telling them smart-devices could help them start their journey by measuring how much they’re already moving!
The can also know about their sedentary time.
One can note that sedentary time is not necessarily related to calories burned.
Now I will plot the graph between calories and total steps to see the relationship between them.
ggplot(data=daily_activity, aes(x=TotalSteps, y = Calories))+ geom_point() + stat_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'
We can clearly see people who took the most total steps tend to burn the most calories. But there is a lot of spread in the value.
Now lets look at the residual or the difference between the observed values and the estimated value
calories.lm <- lm(Calories ~ TotalSteps, data = daily_activity)
calories.res <- resid(calories.lm)
plot(daily_activity$TotalSteps, calories.res, ylab="Residuals",
xlab = "Total Steps", main = "Calories Burned")
abline(0,0)
#plot the density of the residuals
plot(density(calories.res))
#Checking for normality
qqnorm(calories.res)
qqline(calories.res)
So it looks like the spread isn’t as far statistically as we thought.
By seeing linear relationship in the graphs we can market that in order to burn calories we do not need to do high-intensity work out, one just needs to walk.
ggplot(data=sleep_day, aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) + geom_point()
As we can see, there are some outliers, some people that spent a lot of time in bed, but didn’t actually sleep, and then a small batch that slept a whole bunch and spent time in bed
We could definitely market to consumers to monitor their time in bed against their sleep time.
Since sleep time can be related to sedentary minutes I can merge these two data sets by the ID field to analyse further.
combined_sleep_day_data <- merge(sleep_day, daily_activity, by="Id")
head(combined_sleep_day_data)
## Id SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM 1 327
## 2 1503960366 4/12/2016 12:00:00 AM 1 327
## 3 1503960366 4/12/2016 12:00:00 AM 1 327
## 4 1503960366 4/12/2016 12:00:00 AM 1 327
## 5 1503960366 4/12/2016 12:00:00 AM 1 327
## 6 1503960366 4/12/2016 12:00:00 AM 1 327
## TotalTimeInBed ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 346 5/7/2016 11992 7.71 7.71
## 2 346 5/6/2016 12159 8.03 8.03
## 3 346 5/1/2016 10602 6.81 6.81
## 4 346 4/30/2016 14673 9.25 9.25
## 5 346 4/12/2016 13162 8.50 8.50
## 6 346 4/13/2016 10735 6.97 6.97
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 2.46 2.12
## 2 0 1.97 0.25
## 3 0 2.29 1.60
## 4 0 3.56 1.42
## 5 0 1.88 0.55
## 6 0 1.57 0.69
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 3.13 0 37
## 2 5.81 0 24
## 3 2.92 0 33
## 4 4.27 0 52
## 5 6.06 0 25
## 6 4.71 0 21
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 46 175 833 1821
## 2 6 289 754 1896
## 3 35 246 730 1820
## 4 34 217 712 1947
## 5 13 328 728 1985
## 6 19 217 776 1797
Lets once again check the unique ID for the merged dataset.
n_distinct(combined_sleep_day_data$Id)
## [1] 24
Since we had only 24 unique ID for sleep_day dataset we only have those in the combined one because it did inner join.
To see the all the unique IDs from daily_activity dataset, i will use outer join.
combined_sleep_day_data2 <- merge(sleep_day, daily_activity, by="Id", all = TRUE)
head(combined_sleep_day_data2)
## Id SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM 1 327
## 2 1503960366 4/12/2016 12:00:00 AM 1 327
## 3 1503960366 4/12/2016 12:00:00 AM 1 327
## 4 1503960366 4/12/2016 12:00:00 AM 1 327
## 5 1503960366 4/12/2016 12:00:00 AM 1 327
## 6 1503960366 4/12/2016 12:00:00 AM 1 327
## TotalTimeInBed ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 346 5/7/2016 11992 7.71 7.71
## 2 346 5/6/2016 12159 8.03 8.03
## 3 346 5/1/2016 10602 6.81 6.81
## 4 346 4/30/2016 14673 9.25 9.25
## 5 346 4/12/2016 13162 8.50 8.50
## 6 346 4/13/2016 10735 6.97 6.97
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 2.46 2.12
## 2 0 1.97 0.25
## 3 0 2.29 1.60
## 4 0 3.56 1.42
## 5 0 1.88 0.55
## 6 0 1.57 0.69
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 3.13 0 37
## 2 5.81 0 24
## 3 2.92 0 33
## 4 4.27 0 52
## 5 6.06 0 25
## 6 4.71 0 21
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 46 175 833 1821
## 2 6 289 754 1896
## 3 35 246 730 1820
## 4 34 217 712 1947
## 5 13 328 728 1985
## 6 19 217 776 1797
n_distinct(combined_sleep_day_data2$Id)
## [1] 33
Now we have all the 33 unique IDs
For this first plot we’ll try it out with only the 24 unique IDs that have actually logged sleep data.
Let’s run a correlation to see what the correlation coefficient coefficient would be for a linear regression:
sedentary.lm <- lm(SedentaryMinutes ~ TotalTimeInBed, data = combined_sleep_day_data)
sedentary.lm
##
## Call:
## lm(formula = SedentaryMinutes ~ TotalTimeInBed, data = combined_sleep_day_data)
##
## Coefficients:
## (Intercept) TotalTimeInBed
## 921.9598 -0.2678
By seeing this i can say they are not highly related as time in bed goes up, sedentary minutes actually go down, but not to a significant value since it is approx -0.3.
Lets check correlation between veryactiveminutes and calories
veryactive.lm <- lm(Calories ~ VeryActiveMinutes, data = combined_sleep_day_data)
veryactive.lm
##
## Call:
## lm(formula = Calories ~ VeryActiveMinutes, data = combined_sleep_day_data)
##
## Coefficients:
## (Intercept) VeryActiveMinutes
## 2004.36 13.55
One can see they are highly co related, so lets plot a graph.
ggplot(data = combined_sleep_day_data, aes(x=VeryActiveMinutes, y=Calories)) + geom_point() + stat_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
Lets check correlation between total steps taken and calories
ggplot(data = combined_sleep_day_data, aes(x=TotalSteps, y=Calories)) + geom_point() +stat_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
lm(Calories ~ TotalSteps, data = combined_sleep_day_data)
##
## Call:
## lm(formula = Calories ~ TotalSteps, data = combined_sleep_day_data)
##
## Coefficients:
## (Intercept) TotalSteps
## 1.711e+03 7.616e-02
The value shows there is a very small correlation between total steps taken and calories burnt.
Lets check correlation between fairlyactivemiutes taken and calories.
lm(Calories ~ FairlyActiveMinutes, data = combined_sleep_day_data)
##
## Call:
## lm(formula = Calories ~ FairlyActiveMinutes, data = combined_sleep_day_data)
##
## Coefficients:
## (Intercept) FairlyActiveMinutes
## 2211.85 6.76
ggplot(data = combined_sleep_day_data, aes(x=FairlyActiveMinutes, y=Calories)) + geom_point() + stat_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
We can see there is a Moderate relationship for fairly active minutes.
I prepossessed, explored, analysed and visualized the fitbit users dataset quite deeply, and gave some marketing strategy above.
since in the scenario of the case study provided to me it was written Bellabeat collects hydration data because they had one product related to hydration check, they are better because fitbit did not have hydration data.
We can see that more people log their calories, steps taken, etc, and fewer users log their sleep data, and only a select few are logging their weight.
To market this, I initially thought that simply being active and taking steps would help with people on their journey start to burn calories. While this may be true, but the correlation between the two was small so after seeing the correlation maybe we shouldn’t market it that way.
I would focus on the fact that simply collecting more data from different competitors one could see more trends.
Also the best relationship was in between veryactiveminutes and calories so the people who are very active tend to burn the most calorie this can be a good marketing strategy.