This is the capstone project for the Google Data Analytics Certification. For this case study, we are tasked with assisting a wearable fitness technology company, Bellabeat, improve their marketing strategies for their products by investigating customer activity with other fitness trackers like FitBit.
Our goals are to look at datasets to find out:
How are customers using other fitness trackers, like fitbit, in their daily life?
What particular features seem to be the most heavily used?
What features do Bellabeat products already have that consumers want, and how do we focus marketing on those aspects?
What features should Bellabeat products consider adding to entice more customers?
The first dataset we’ll be looking at comes from here: https://www.kaggle.com/arashnic/fitbit
There are a number of different csv files that range from Daily activity, calories, steps; hourly calories, intensities, and steps; and heart rate, sleep data and weight logs.
A few immediate things come to mind when simply looking at the types of data collected by these 30 fitbit users.
No water intake data has been collected
These data may not actually assist me, but that will come with exploration.
Data will be stored in our documents folder which will serve as our working directory for the project
We’ll be using the tidyverse package as well as the skimr, here, and janitor packages for help with this project.
We’re also using the sqldf package, which will allow us to emulate SQL syntax when looking at data
Here we’ll create our data frames for review. The data frames I’ll be working with in this review will be creating objects for:
daily_activity
daily calories
daily sleep
weight log info
daily intensities
We’ll follow typical naming conventions based on the csv file names.
daily_activity <- read.csv("dailyActivity_merged.csv")
daily_calories <- read.csv("dailyCalories_merged.csv")
sleep_day <- read.csv("sleepDay_merged.csv")
daily_intensities <- read.csv("dailyIntensities_merged.csv")
weight_log <- read.csv("weightLogInfo_merged.csv")
Let’s take a beat to investigate the tables. For each one we’ll look at the first six values using the head() function, and the column names with the colnames() function
Since we’re looking at 5 different data frames, it might be a bit overwhelming to look at all of them at once, but it’s critical to get a look at each of the tables now
head(daily_activity)
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366 4/12/2016 13162 8.50 8.50
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 1.88 0.55
## 2 0 1.57 0.69
## 3 0 2.44 0.40
## 4 0 2.14 1.26
## 5 0 2.71 0.41
## 6 0 3.19 0.78
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 6.06 0 25
## 2 4.71 0 21
## 3 3.91 0 30
## 4 2.83 0 29
## 5 5.04 0 36
## 6 2.51 0 38
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 13 328 728 1985
## 2 19 217 776 1797
## 3 11 181 1218 1776
## 4 34 209 726 1745
## 5 10 221 773 1863
## 6 20 164 539 1728
colnames(daily_activity)
## [1] "Id" "ActivityDate"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories"
glimpse(daily_activity)
## Rows: 940
## Columns: 15
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036~
## $ ActivityDate <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/~
## $ TotalSteps <int> 13162, 10735, 10460, 9762, 12669, 9705, 13019~
## $ TotalDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ TrackerDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5~
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3~
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0~
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveMinutes <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4~
## $ FairlyActiveMinutes <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21~
## $ LightlyActiveMinutes <int> 328, 217, 181, 209, 221, 164, 233, 264, 205, ~
## $ SedentaryMinutes <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 818~
## $ Calories <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203~
head(daily_calories)
## Id ActivityDay Calories
## 1 1503960366 4/12/2016 1985
## 2 1503960366 4/13/2016 1797
## 3 1503960366 4/14/2016 1776
## 4 1503960366 4/15/2016 1745
## 5 1503960366 4/16/2016 1863
## 6 1503960366 4/17/2016 1728
colnames(daily_calories)
## [1] "Id" "ActivityDay" "Calories"
head(sleep_day)
## Id SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM 1 327
## 2 1503960366 4/13/2016 12:00:00 AM 2 384
## 3 1503960366 4/15/2016 12:00:00 AM 1 412
## 4 1503960366 4/16/2016 12:00:00 AM 2 340
## 5 1503960366 4/17/2016 12:00:00 AM 1 700
## 6 1503960366 4/19/2016 12:00:00 AM 1 304
## TotalTimeInBed
## 1 346
## 2 407
## 3 442
## 4 367
## 5 712
## 6 320
colnames(sleep_day)
## [1] "Id" "SleepDay" "TotalSleepRecords"
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
head(daily_intensities)
## Id ActivityDay SedentaryMinutes LightlyActiveMinutes
## 1 1503960366 4/12/2016 728 328
## 2 1503960366 4/13/2016 776 217
## 3 1503960366 4/14/2016 1218 181
## 4 1503960366 4/15/2016 726 209
## 5 1503960366 4/16/2016 773 221
## 6 1503960366 4/17/2016 539 164
## FairlyActiveMinutes VeryActiveMinutes SedentaryActiveDistance
## 1 13 25 0
## 2 19 21 0
## 3 11 30 0
## 4 34 29 0
## 5 10 36 0
## 6 20 38 0
## LightActiveDistance ModeratelyActiveDistance VeryActiveDistance
## 1 6.06 0.55 1.88
## 2 4.71 0.69 1.57
## 3 3.91 0.40 2.44
## 4 2.83 1.26 2.14
## 5 5.04 0.41 2.71
## 6 2.51 0.78 3.19
colnames(daily_intensities)
## [1] "Id" "ActivityDay"
## [3] "SedentaryMinutes" "LightlyActiveMinutes"
## [5] "FairlyActiveMinutes" "VeryActiveMinutes"
## [7] "SedentaryActiveDistance" "LightActiveDistance"
## [9] "ModeratelyActiveDistance" "VeryActiveDistance"
It looks like this includes a boolean field of manual entry, and there are far fewer observations. This means only a certain subset of users are actually going to log their weight. Granted, this brings up the question: how does the fitbit log your weight?
head(weight_log)
## Id Date WeightKg WeightPounds Fat BMI
## 1 1503960366 5/2/2016 11:59:59 PM 52.6 115.9631 22 22.65
## 2 1503960366 5/3/2016 11:59:59 PM 52.6 115.9631 NA 22.65
## 3 1927972279 4/13/2016 1:08:52 AM 133.5 294.3171 NA 47.54
## 4 2873212765 4/21/2016 11:59:59 PM 56.7 125.0021 NA 21.45
## 5 2873212765 5/12/2016 11:59:59 PM 57.3 126.3249 NA 21.69
## 6 4319703577 4/17/2016 11:59:59 PM 72.4 159.6147 25 27.45
## IsManualReport LogId
## 1 True 1.462234e+12
## 2 True 1.462320e+12
## 3 False 1.460510e+12
## 4 True 1.461283e+12
## 5 True 1.463098e+12
## 6 True 1.460938e+12
colnames(weight_log)
## [1] "Id" "Date" "WeightKg" "WeightPounds"
## [5] "Fat" "BMI" "IsManualReport" "LogId"
All 5 data frames have the same ‘ID’ field, so we can merge the datasets if need be.
It looks like the daily_activity, daily_calories, and daily_intensities have the exact same number of observations.
Furthermore, it seems the daily_activity table might have a log of calories and intensities already, so we should confirm that the values actually match for any given ‘ID’ number.
Let’s use SQL syntax to see if there are any values in daily_calories that are in daily_activity… however, this won’t work if the two dataframes have a different number of columsn, so we’ll need to create a temporary dataframe where we select the important columns from daily_activity. Let’s just call it “daily_activity2”
daily_activity2 <- daily_activity %>%
select(Id, ActivityDate, Calories)
head(daily_activity2)
## Id ActivityDate Calories
## 1 1503960366 4/12/2016 1985
## 2 1503960366 4/13/2016 1797
## 3 1503960366 4/14/2016 1776
## 4 1503960366 4/15/2016 1745
## 5 1503960366 4/16/2016 1863
## 6 1503960366 4/17/2016 1728
Great, now let’s see what’s the same between the two data frames of daily_activity2 and daily_calories
sql_check1 <- sqldf('SELECT * FROM daily_activity2 INTERSECT SELECT * FROM daily_calories')
head(sql_check1)
## Id ActivityDate Calories
## 1 1503960366 4/12/2016 1985
## 2 1503960366 4/13/2016 1797
## 3 1503960366 4/14/2016 1776
## 4 1503960366 4/15/2016 1745
## 5 1503960366 4/16/2016 1863
## 6 1503960366 4/17/2016 1728
nrow(sql_check1)
## [1] 940
It looks like there were 940 observations from the sql query, so it’s safe to assume that the values are the same between the dataframes.
This leads us to assume the same is true for daily intensities, so we can drop those two dataframes from analysis… but just for completion sake, lets do the same check again :)
We will have to create another temporary data frame since daily_intensities only has 10 variables.
daily_activity3 <- daily_activity %>%
select(Id, ActivityDate, SedentaryMinutes, LightlyActiveMinutes, FairlyActiveMinutes, VeryActiveMinutes, SedentaryActiveDistance, LightActiveDistance, ModeratelyActiveDistance, VeryActiveDistance)
head(daily_activity3)
## Id ActivityDate SedentaryMinutes LightlyActiveMinutes
## 1 1503960366 4/12/2016 728 328
## 2 1503960366 4/13/2016 776 217
## 3 1503960366 4/14/2016 1218 181
## 4 1503960366 4/15/2016 726 209
## 5 1503960366 4/16/2016 773 221
## 6 1503960366 4/17/2016 539 164
## FairlyActiveMinutes VeryActiveMinutes SedentaryActiveDistance
## 1 13 25 0
## 2 19 21 0
## 3 11 30 0
## 4 34 29 0
## 5 10 36 0
## 6 20 38 0
## LightActiveDistance ModeratelyActiveDistance VeryActiveDistance
## 1 6.06 0.55 1.88
## 2 4.71 0.69 1.57
## 3 3.91 0.40 2.44
## 4 2.83 1.26 2.14
## 5 5.04 0.41 2.71
## 6 2.51 0.78 3.19
Great, now let’s see what’s the same between the two data frames of daily_activity2 and daily_calories
sql_check2 <- sqldf('SELECT * FROM daily_activity3 INTERSECT SELECT * FROM daily_intensities')
head(sql_check2)
## Id ActivityDate SedentaryMinutes LightlyActiveMinutes
## 1 1503960366 4/12/2016 728 328
## 2 1503960366 4/13/2016 776 217
## 3 1503960366 4/14/2016 1218 181
## 4 1503960366 4/15/2016 726 209
## 5 1503960366 4/16/2016 773 221
## 6 1503960366 4/17/2016 539 164
## FairlyActiveMinutes VeryActiveMinutes SedentaryActiveDistance
## 1 13 25 0
## 2 19 21 0
## 3 11 30 0
## 4 34 29 0
## 5 10 36 0
## 6 20 38 0
## LightActiveDistance ModeratelyActiveDistance VeryActiveDistance
## 1 6.06 0.55 1.88
## 2 4.71 0.69 1.57
## 3 3.91 0.40 2.44
## 4 2.83 1.26 2.14
## 5 5.04 0.41 2.71
## 6 2.51 0.78 3.19
nrow(sql_check2)
## [1] 940
Looks like it’s that magical 940 observations, so we can officially remove those two datasets from analysis Hooray for exploration!!
That leaves us with 3 data frames:
daily_activity
sleep_day
weight_log
First, it looks like there may be more id’s in the daily_activity dataframe over the sleep_day dataframe, and even more over the weight_log data frame, so let’s use the n_distinct() function to find out
n_distinct(daily_activity$Id)
## [1] 33
n_distinct(sleep_day$Id)
## [1] 24
n_distinct(weight_log$Id)
## [1] 8
How many observations are there in each dataframe?
nrow(daily_activity)
## [1] 940
nrow(sleep_day)
## [1] 413
nrow(weight_log)
## [1] 67
What are some quick summary statistics we’d want to know about each data frame?
For the daily activity dataframe:
daily_activity %>%
select(TotalSteps,
TotalDistance,
SedentaryMinutes,
VeryActiveMinutes) %>%
summary()
## TotalSteps TotalDistance SedentaryMinutes VeryActiveMinutes
## Min. : 0 Min. : 0.000 Min. : 0.0 Min. : 0.00
## 1st Qu.: 3790 1st Qu.: 2.620 1st Qu.: 729.8 1st Qu.: 0.00
## Median : 7406 Median : 5.245 Median :1057.5 Median : 4.00
## Mean : 7638 Mean : 5.490 Mean : 991.2 Mean : 21.16
## 3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.:1229.5 3rd Qu.: 32.00
## Max. :36019 Max. :28.030 Max. :1440.0 Max. :210.00
For the sleep dataframe:
sleep_day %>%
select(TotalSleepRecords,
TotalMinutesAsleep,
TotalTimeInBed) %>%
summary()
## TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## Min. :1.000 Min. : 58.0 Min. : 61.0
## 1st Qu.:1.000 1st Qu.:361.0 1st Qu.:403.0
## Median :1.000 Median :433.0 Median :463.0
## Mean :1.119 Mean :419.5 Mean :458.6
## 3rd Qu.:1.000 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :3.000 Max. :796.0 Max. :961.0
For the weight dataframe
weight_log %>%
select(WeightPounds,
BMI) %>%
summary()
## WeightPounds BMI
## Min. :116.0 Min. :21.45
## 1st Qu.:135.4 1st Qu.:23.96
## Median :137.8 Median :24.39
## Mean :158.8 Mean :25.19
## 3rd Qu.:187.5 3rd Qu.:25.56
## Max. :294.3 Max. :47.54
What’s the relationship between steps taken in a day and sedentary minutes? It seems that we have a negative relationship between total steps taken and the minutes someone has remained sedentary. We also see that calories generally trend positively with total steps taking.
This shows that the data seem fairly accurate when it comes to recording steps and sedentary minutes. We could easily market this to consumers by telling them smart-devices could help them start their journey by measuring how much they’re already moving!
You could also market the devices as a way to let people know how sedentary they actually are, and how active they could be. - not sure if I find that ethical, but hey, fear sells!
We can also note that sedentary time is not necessarily related to calories burned.
ggplot(data=daily_activity, aes(x=TotalSteps, y=SedentaryMinutes, color = Calories)) + geom_point()
Let’s plot that really quickly and get rid of some of this noise.
ggplot(data=daily_activity, aes(x=TotalSteps, y = Calories))+ geom_point() + stat_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'
It seems pretty clear from the above graph, that in general the people who took the most total steps tended to burn the most calories; however, there’s a large spread there clustered towards the lower amounts.
Let’s take a look at the residuals there, or the differences between the observed values and the the estimated value.
calories.lm <- lm(Calories ~ TotalSteps, data = daily_activity)
calories.res <- resid(calories.lm)
#This seems kinda messy
plot(daily_activity$TotalSteps, calories.res, ylab="Residuals",
xlab = "Total Steps", main = "Calories Burned")
abline(0,0)
#plot the density of the residuals
plot(density(calories.res))
#Checking for normality
qqnorm(calories.res)
qqline(calories.res)
So it looks like the spread isn’t as far statistically as we thought. This is good news though!
Here you could easily market that in order to burn calories, you don’t have to do high-intensity work outs you can just get out there and start walking!
This would be such a relief as a consumer, because this proves to them that you can have results without starting a gym membership, or by starting a large workout routine. You can burn calories, simply by walking.
Let’s look at our sleep data, we should see a practically 1:1 trend from the amount of time slept and the total time someone spends in bed.
ggplot(data=sleep_day, aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) + geom_point()
As we can see, there are some outliers here! some data points that spent a lot of time in bed, but didn’t actually sleep, and then a small batch that slept a whole bunch and spent time in bed (I can relate)
We could definitely market to consumers to monitor their time in bed with the watch against their sleep time.
I wonder how this relates to the sedentary minutes data in the last dataset??
combined_sleep_day_data <- merge(sleep_day, daily_activity, by="Id")
head(combined_sleep_day_data)
## Id SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM 1 327
## 2 1503960366 4/12/2016 12:00:00 AM 1 327
## 3 1503960366 4/12/2016 12:00:00 AM 1 327
## 4 1503960366 4/12/2016 12:00:00 AM 1 327
## 5 1503960366 4/12/2016 12:00:00 AM 1 327
## 6 1503960366 4/12/2016 12:00:00 AM 1 327
## TotalTimeInBed ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 346 5/7/2016 11992 7.71 7.71
## 2 346 5/6/2016 12159 8.03 8.03
## 3 346 5/1/2016 10602 6.81 6.81
## 4 346 4/30/2016 14673 9.25 9.25
## 5 346 4/12/2016 13162 8.50 8.50
## 6 346 4/13/2016 10735 6.97 6.97
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 2.46 2.12
## 2 0 1.97 0.25
## 3 0 2.29 1.60
## 4 0 3.56 1.42
## 5 0 1.88 0.55
## 6 0 1.57 0.69
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 3.13 0 37
## 2 5.81 0 24
## 3 2.92 0 33
## 4 4.27 0 52
## 5 6.06 0 25
## 6 4.71 0 21
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 46 175 833 1821
## 2 6 289 754 1896
## 3 35 246 730 1820
## 4 34 217 712 1947
## 5 13 328 728 1985
## 6 19 217 776 1797
As we expected, there are only 24 unique id’s in the combined dataset, since only 24 users actively used the sleep data
n_distinct(combined_sleep_day_data$Id)
## [1] 24
We could perform an outer join to include all of the fitbit users in the dataset, but theoretically, their sleep data would be empty (either null or N/A). Let’s try it!
combined_sleep_day_data2 <- merge(sleep_day, daily_activity, by="Id", all = TRUE)
head(combined_sleep_day_data2)
## Id SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM 1 327
## 2 1503960366 4/12/2016 12:00:00 AM 1 327
## 3 1503960366 4/12/2016 12:00:00 AM 1 327
## 4 1503960366 4/12/2016 12:00:00 AM 1 327
## 5 1503960366 4/12/2016 12:00:00 AM 1 327
## 6 1503960366 4/12/2016 12:00:00 AM 1 327
## TotalTimeInBed ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 346 5/7/2016 11992 7.71 7.71
## 2 346 5/6/2016 12159 8.03 8.03
## 3 346 5/1/2016 10602 6.81 6.81
## 4 346 4/30/2016 14673 9.25 9.25
## 5 346 4/12/2016 13162 8.50 8.50
## 6 346 4/13/2016 10735 6.97 6.97
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 2.46 2.12
## 2 0 1.97 0.25
## 3 0 2.29 1.60
## 4 0 3.56 1.42
## 5 0 1.88 0.55
## 6 0 1.57 0.69
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 3.13 0 37
## 2 5.81 0 24
## 3 2.92 0 33
## 4 4.27 0 52
## 5 6.06 0 25
## 6 4.71 0 21
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 46 175 833 1821
## 2 6 289 754 1896
## 3 35 246 730 1820
## 4 34 217 712 1947
## 5 13 328 728 1985
## 6 19 217 776 1797
n_distinct(combined_sleep_day_data2$Id)
## [1] 33
Excellent! as we can see all 33 values are available. Let’s plot that sedentary time and time in bed data!
For this first plot we’ll try it out with only the 24 unique IDs that have actually logged sleep data
Let’s run a correlation to see what the correlation coefficient coefficient would be for a linear regression:
sedentary.lm <- lm(SedentaryMinutes ~ TotalTimeInBed, data = combined_sleep_day_data)
sedentary.lm
##
## Call:
## lm(formula = SedentaryMinutes ~ TotalTimeInBed, data = combined_sleep_day_data)
##
## Coefficients:
## (Intercept) TotalTimeInBed
## 921.9598 -0.2678
And now a pearson correlation coefficient:
cor(combined_sleep_day_data$TotalTimeInBed,combined_sleep_day_data$SedentaryMinutes, method = "pearson")
## [1] -0.128011
It looks like these two things are not related much at all. Which is an interesting finding. As time in bed goes up, sedentary minutes actually go down, but not to a statistically significant degree.
Before I go, I’ve made a few extra graphs for funsies:
There’s a strong positive correlation between very active minutes and calories burned.
ggplot(data = combined_sleep_day_data, aes(x=VeryActiveMinutes, y=Calories)) + geom_point() + stat_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
lm(Calories ~ VeryActiveMinutes, data = combined_sleep_day_data)
##
## Call:
## lm(formula = Calories ~ VeryActiveMinutes, data = combined_sleep_day_data)
##
## Coefficients:
## (Intercept) VeryActiveMinutes
## 2004.36 13.55
And it looks like a very small correlation for between total steps taken and calories burned.
ggplot(data = combined_sleep_day_data, aes(x=TotalSteps, y=Calories)) + geom_point() +stat_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
lm(Calories ~ TotalSteps, data = combined_sleep_day_data)
##
## Call:
## lm(formula = Calories ~ TotalSteps, data = combined_sleep_day_data)
##
## Coefficients:
## (Intercept) TotalSteps
## 1.711e+03 7.616e-02
But a Moderate relationship for fairly active minutes.
ggplot(data = combined_sleep_day_data, aes(x=FairlyActiveMinutes, y=Calories)) + geom_point() + stat_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
lm(Calories ~ FairlyActiveMinutes, data = combined_sleep_day_data)
##
## Call:
## lm(formula = Calories ~ FairlyActiveMinutes, data = combined_sleep_day_data)
##
## Coefficients:
## (Intercept) FairlyActiveMinutes
## 2211.85 6.76
We looked at this dataset of fitbit users pretty intensively to get an idea on what features are being used, and how we can market our items.
Fitbit does not collect hydration data, that puts Bellabeat way above the competition!
We showed that more people log their calories, steps taken, etc, and fewer users log their sleep data, and only a select few are logging their weight
To market this, we initially thought that simply being active and taking steps would help with people on their journey start to burn calories. While this may be true, we see that the correlation is beyond small, and maybe we shouldn’t market it that way.
I would focus on the fat that simply collecting data will help you set goals. Take a strategy from a platform like noom, and make it our own here at Bellabeat!