Analysis of one of Bellabeat’s products to gain insights into how the consumers are using their smart devices, and thus, provide business solutions and marketing strategies for the company to increase sales and reveal more growth opportunities.
Analyzing smart device usage data to spot trends on how the consumers are using the product, and then use the trends to influence Bellabeat marketing strategy and come up with business recommendations.
Stakeholders:
There are primary and secondary stakeholders for this project:
Primary:
Secondary:
The Fitbit Fitness Tracker Data is an open-source dataset available on Kaggle. It was generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016 and 05.12.2016. It contains a personal fitness tracker from 30 eligible Fitbit users who consented to the submission of personal tracker data. The dataset is made of 18 CSV files containing different customers’ data, including minute-level output for physical activity, heart rate, and sleep monitoring. The data was gathered in hourly, weekly, and monthly basis.
The ROCCC parameters need to be present to assume that the data is credible.
Reliable: Data source is not reliable since it only collected the data from 30 users which will not represent the whole population. The results will be biased to a certain extent.
Original: Data was collected through the Amazon Mechanical Turk survey, and thus, making the data not original since they are second- or third-party information.
Comprehensive: There is some important missing information about the users such as age, and gender, which will make the data less comprehensive leading to less accurate conclusions.
Current: Data is from 2016 and might give not-so-efficient business recommendations now.
Cited: Data is not cited. There is only the name of the survey that appears, which makes it difficult to assume that the data is credible.
The data is stored in spreadsheets CSV files. It will be hard to process it with spreadsheets. Using either SQL or R would be better for this analysis.
The analysis will be made using R. The first step would be to load the data into R environment and then have a first look.
library(readr)
daily_activity <- read_csv("C:/Users/herot/Desktop/Fitabase Data/dailyActivity_merged.csv")
daiy_calories <- read_csv("C:/Users/herot/Desktop/Fitabase Data/dailyCalories_merged.csv")
daily_intensities <- read_csv("C:/Users/herot/Desktop/Fitabase Data/dailyIntensities_merged.csv")
daily_steps <- read_csv("C:/Users/herot/Desktop/Fitabase Data/dailySteps_merged.csv")
daily_sleep <- read_csv("C:/Users/herot/Desktop/Fitabase Data/sleepDay_merged.csv")
heartrate_seconds <- read_csv("C:/Users/herot/Desktop/Fitabase Data/heartrate_seconds_merged.csv")
hourly_calories <- read_csv("C:/Users/herot/Desktop/Fitabase Data/hourlyCalories_merged.csv")
hourly_intensities <- read_csv("C:/Users/herot/Desktop/Fitabase Data/hourlyIntensities_merged.csv")
hourly_steps <- read_csv("C:/Users/herot/Desktop/Fitabase Data/hourlySteps_merged.csv")
minute_calories_narrow <- read_csv("C:/Users/herot/Desktop/Fitabase Data/minuteCaloriesNarrow_merged.csv")
minute_calories_wide <- read_csv("C:/Users/herot/Desktop/Fitabase Data/minuteCaloriesWide_merged.csv")
minute_intensities_narrow <- read_csv("C:/Users/herot/Desktop/Fitabase Data/minuteIntensitiesNarrow_merged.csv")
minute_intensities_wide <- read_csv("C:/Users/herot/Desktop/Fitabase Data/minuteIntensitiesWide_merged.csv")
minute_mets_narrow <- read_csv("C:/Users/herot/Desktop/Fitabase Data/minuteMETsNarrow_merged.csv")
minute_sleep <- read_csv("C:/Users/herot/Desktop/Fitabase Data/minuteSleep_merged.csv")
minute_steps_narrow <- read_csv("C:/Users/herot/Desktop/Fitabase Data/minuteStepsNarrow_merged.csv")
minute_steps_wide <- read_csv("C:/Users/herot/Desktop/Fitabase Data/minuteStepsWide_merged.csv")
weight_log <- read_csv("C:/Users/herot/Desktop/Fitabase Data/weightLogInfo_merged.csv")After a first observation of the data, only 4 files seem to be relevant for our business task: daily_activity which contains all the variables into one file grouped daily, heartrate_seconds, daily_sleep, and weight_log.
rm(daily_intensities)
rm(daily_steps)
rm(hourly_calories)
rm(hourly_intensities)
rm(hourly_steps)
rm(minute_calories_narrow)
rm(minute_calories_wide)
rm(minute_intensities_narrow)
rm(minute_intensities_wide)
rm(minute_mets_narrow)
rm(minute_sleep)
rm(minute_steps_narrow)
rm(minute_steps_wide)
rm(daiy_calories)library(tidyverse)
library(lubridate)
library(hrbrthemes)
library(corrplot)
library(ggcorrplot)
library(viridis)weight_log <- weight_log %>%
separate(Date, c("Date", "Time"), " ")
heartrate_seconds <- heartrate_seconds %>%
separate(Time, c("Date", "Time"), " ")
daily_sleep <- daily_sleep %>%
separate(SleepDay, c("Date", "Time"), " ")
daily_sleep <- subset(daily_sleep, select=-Time)For the daily sleep data, the time column is irrelevant for the analysis because the data is daily. It could be dropped from the table without affecting the analysis.
heartrate_daily <-
tibble(heartrate_seconds %>%
group_by(Date, Id) %>%
summarise(Mean_Heartrate=(mean(Value))))heartrate_time <- read_csv("C:/Users/herot/Desktop/Fitabase Data/heartrate_seconds_merged.csv")
heartrate_time$time <- dmy_hms(heartrate_time$Time)
heartrate_time <- na.omit(heartrate_time) ## remove missing values
breaks <- hour(hm("6:00", "12:00", "16:00", "19:00", "23:59"))
labels <- c("Morning", "Afternoon", "Evening", "Night")
heartrate_time$Time_of_day <- cut(x=hour(heartrate_time$time), breaks = breaks, labels = labels, include.lowest = TRUE)
heartrate_time <- heartrate_time %>% drop_na()heartrate_grouped <-
tibble(heartrate_time %>%
group_by(Time_of_day) %>%
summarise(heartrate_mean=(mean(Value))))
heartrate_grouped <- heartrate_grouped %>% drop_na()| Time_of_day | heartrate_mean |
|---|---|
| Morning | 78.31348 |
| Afternoon | 81.17850 |
| Evening | 84.13020 |
| Night | 76.59507 |
nrow(daily_activity[duplicated(daily_activity),])## [1] 0
nrow(heartrate_daily[duplicated(heartrate_daily),])## [1] 0
nrow(daily_sleep[duplicated(daily_sleep),])## [1] 3
nrow(weight_log[duplicated(weight_log),])## [1] 0
The sleep dataset has 3 duplicates, those should be removed to avoid skewed metrics and therefore wrong conclusions.
daily_sleep <- dplyr::distinct(daily_sleep)which(is.na(daily_activity))## integer(0)
which(is.na(heartrate_daily))## integer(0)
which(is.na(daily_sleep))## integer(0)
which(is.na(weight_log))## [1] 337 338 339 340 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356
## [20] 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375
## [39] 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394
## [58] 395 396 397 398 399 400 401 402
The null values are only present in the weight dataset. Finding which column has the null values and then removing it.
colnames(weight_log)[colSums(is.na(weight_log)) > 0]## [1] "Fat"
weight_log <- select(weight_log, -Fat)unique_dataframe <- merge(daily_activity, daily_sleep, by.x=c("Id", "ActivityDate"), by.y=c("Id", "Date"))
unique_dataframe <- merge(unique_dataframe, select(weight_log, -Time), by.x=c("Id", "ActivityDate"), by.y=c("Id", "Date"))
unique_dataframe <- merge(unique_dataframe, heartrate_daily, by.x=c("Id", "ActivityDate"), by.y=c("Id", "Date"))length(unique(daily_activity$Id))## [1] 33
length(unique(heartrate_daily$Id))## [1] 14
length(unique(daily_sleep$Id))## [1] 24
length(unique(weight_log$Id))## [1] 8
The highest number of participants that took part in the survey is 33, while only 3 of those participants took part in all the surveys.
daily_activity %>%
select(TotalSteps,
TotalDistance,
TrackerDistance,
VeryActiveMinutes,
FairlyActiveMinutes,
LightlyActiveMinutes,
SedentaryMinutes,
Calories) %>%
summary()## TotalSteps TotalDistance TrackerDistance VeryActiveMinutes
## Min. : 0 Min. : 0.000 Min. : 0.000 Min. : 0.00
## 1st Qu.: 3790 1st Qu.: 2.620 1st Qu.: 2.620 1st Qu.: 0.00
## Median : 7406 Median : 5.245 Median : 5.245 Median : 4.00
## Mean : 7638 Mean : 5.490 Mean : 5.475 Mean : 21.16
## 3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.: 7.710 3rd Qu.: 32.00
## Max. :36019 Max. :28.030 Max. :28.030 Max. :210.00
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0
## 1st Qu.: 0.00 1st Qu.:127.0 1st Qu.: 729.8 1st Qu.:1828
## Median : 6.00 Median :199.0 Median :1057.5 Median :2134
## Mean : 13.56 Mean :192.8 Mean : 991.2 Mean :2304
## 3rd Qu.: 19.00 3rd Qu.:264.0 3rd Qu.:1229.5 3rd Qu.:2793
## Max. :143.00 Max. :518.0 Max. :1440.0 Max. :4900
Observations:
daily_sleep %>%
select(TotalSleepRecords,
TotalMinutesAsleep,
TotalTimeInBed) %>%
summary()## TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## Min. :1.00 Min. : 58.0 Min. : 61.0
## 1st Qu.:1.00 1st Qu.:361.0 1st Qu.:403.8
## Median :1.00 Median :432.5 Median :463.0
## Mean :1.12 Mean :419.2 Mean :458.5
## 3rd Qu.:1.00 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :3.00 Max. :796.0 Max. :961.0
Observations:
summary(heartrate_daily$Mean_Heartrate)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 59.38 70.47 77.49 78.61 84.93 109.79
Observations:
heartrate_time %>%
select(Value,
Time_of_day) %>%
summary()## Value Time_of_day
## Min. : 38.0 Morning :329894
## 1st Qu.: 66.0 Afternoon:208150
## Median : 77.0 Evening :150108
## Mean : 79.8 Night :139066
## 3rd Qu.: 90.0
## Max. :199.0
heartrate_grouped## # A tibble: 4 x 2
## Time_of_day heartrate_mean
## <fct> <dbl>
## 1 Morning 78.3
## 2 Afternoon 81.2
## 3 Evening 84.1
## 4 Night 76.6
Observations:
weight_log %>%
select(WeightKg,
BMI) %>%
summary()## WeightKg BMI
## Min. : 52.60 Min. :21.45
## 1st Qu.: 61.40 1st Qu.:23.96
## Median : 62.50 Median :24.39
## Mean : 72.04 Mean :25.19
## 3rd Qu.: 85.05 3rd Qu.:25.56
## Max. :133.50 Max. :47.54
Observations:
viz1 <- ggplot(data=daily_activity, aes(x=TotalSteps, y=Calories))+
geom_point(colour="yellow")+
geom_smooth(method=lm, colour="white")+
labs(title = "Total Steps VS Calories", x="Total Steps", y="Calories")+
scale_x_comma()+
theme_ft_rc()
plot(viz1)## `geom_smooth()` using formula 'y ~ x'
The plot shows that there is a positive correlation between the total steps and the calories burnt. However, there seem to be certain outliers that do not follow the fore mentioned correlation.
viz2 <- ggplot(data=daily_activity, aes(x=TrackerDistance, y=TotalDistance))+
geom_point(colour="yellow")+
geom_smooth(method=lm, colour="white")+
labs(title = "Total Distance VS Tracker Distance", x="Tracker Distance", y="Total Distance")+
scale_x_comma()+
theme_ft_rc()
plot(viz2)## `geom_smooth()` using formula 'y ~ x'
The plot shows that the tracked distance and total distance are almost identical. This means that the Bellabeat smartwatch is recording almost perfectly the steps performed by the users. In certain cases, the total distance is greater than the tracked distance, and this could be because of human error, i.e. the users possibly forgot to wear the smartwatch for a certain amount of time.
viz3 <- ggplot(data=daily_activity, aes(x=VeryActiveMinutes, y=Calories))+
geom_point(colour="yellow")+
geom_smooth(colour="white")+
labs(title="Very Active Minutes VS Calories", x= "Very Active Minutes")+
theme_ft_rc()
viz4 <- ggplot(data=daily_activity, aes(x=FairlyActiveMinutes, y=Calories))+
geom_point(colour="yellow")+
geom_smooth(colour="white")+
labs(title="Fairly Active Minutes VS Calories", x= "Fairly Active Minutes")+
theme_ft_rc()
viz5 <- ggplot(data=daily_activity, aes(x=LightlyActiveMinutes, y=Calories))+
geom_point(colour="yellow")+
geom_smooth(colour="white")+
labs(title="Lightly Active Minutes VS Calories", x= "Lightly Active Minutes")+
theme_ft_rc()plot(viz3)plot(viz4)plot(viz5)The three plots show that the very active minutes and lightly active minutes are positively correlated with the calories burnt. As for the fairly active minutes, it is negatively correlated with the calories burnt. Most of the calories distribution of very active minutes and fairly active minutes is around 0, however, for the lightly active minutes, the calories distribution is around it.
viz6 <- ggplot(data=daily_sleep, aes(x=TotalMinutesAsleep, y=TotalTimeInBed))+
geom_point(colour="yellow")+
geom_smooth(method=lm, colour="white")+
labs(title="Total Minutes in Bed VS Total Minutes Asleep", x="Total Minutes Asleep", y="Total Minutes in Bed")+
theme_ft_rc()
plot(viz6)## `geom_smooth()` using formula 'y ~ x'
The plot shows that the total minutes asleep and the time passed in bed are almost identical, except for some outliers.
viz7 <- ggplot(data=unique_dataframe, aes(x=VeryActiveMinutes, y=TotalMinutesAsleep))+
geom_point(colour="yellow")+
geom_smooth(colour="white")+
labs(title="Total Minutes Asleep VS Very Active Minutes", x= "Very Active Minutes", y="Total Minutes Asleep")+
theme_ft_rc()
viz8 <- ggplot(data=unique_dataframe, aes(x=FairlyActiveMinutes, y=TotalMinutesAsleep))+
geom_point(colour="yellow")+
geom_smooth(colour="white")+
labs(title="Total Minutes Asleep VS Fairly Active Minutes", x= "Fairly Active Minutes", y="Total Minutes Asleep")+
theme_ft_rc()
viz9 <- ggplot(data=unique_dataframe, aes(x=LightlyActiveMinutes, y=TotalMinutesAsleep))+
geom_point(colour="yellow")+
geom_smooth(colour="white")+
labs(title="Total Minutes Asleep VS Lightly Active Minutes", x= "Lightly Active Minutes", y="Total Minutes Asleep")+
theme_ft_rc()plot(viz7)plot(viz8)plot(viz9)The three plots show that the very active minutes and the fairly active minutes are positively correlated with the total minutes asleep, as opposed to the lightly active minutes which is negatively correlated with the total minutes asleep.
variables_matrix <- select(daily_activity, TotalSteps, TotalDistance, VeryActiveDistance, ModeratelyActiveDistance, LightActiveDistance, VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes, Calories)
correlation_matrix = cor(variables_matrix)
ggcorrplot(correlation_matrix, lab = TRUE)+
scale_fill_gradient2(low = "red", high = "darkslateblue", mid = "white")The correlation matrix shows that total steps, total distance, and very active minutes with a correlation coefficient of 0.59, 0.64, and 0.62 respectively, are highly correlated with the burnt calories.
The difference between the total time in bed and the total time asleep is low, with only 40 minutes in average. The average total time asleep is 7 hours which depicts that most of the persons are having sufficient sleep which is good for their health.
The average body mass index recorded is 25.19, which falls in the overweight range between 25 and 29.9. This depicts that most of the persons are either not following a good diet, or not doing enough exercise.
The tracked distance and the total steps are almost identical, this means that the Bellabeat smartwatch is accurate and does not need any improvement.
The sample size is considerably low, and the conclusions driven based on the analysis of the sample might not be accurate, with only 3 persons in total that completed all the surveys.
A mobile push notification for reminding the users to be active could be a good idea since there is a correlation between the total steps and the calories burnt.
Adding daily steps goals and achievements to the mobile application to incentivize the users to be active.
The maximum heart rate recorded was 200 which is very high and alarming. An alert system that triggers when the heart rate exceeds a certain threshold would be a good feature to implement as it could save lives.
The average total time asleep of 7 hours is good. In order to let the users keep the same sleep routine, a sleep time reminder could be added to the application.
Performing other surveys to gather more data, such as the age and the gender, and from a bigger sample size might be important in order to do more targeted improvements. To encourage users to complete all the surveys, rewards should be given to whoever completed all of them. The rewards will be distributed from a budget specifically allocated to this matter.