Bellabeat founded in 2013, is trying to find out how the market competition is with their products. In this project I’m tying to find answers to below points:
To find answers to stakeholder’s questions, we can breakdown this project into 6 phases:
====================================================================================================================================
In this phase I am trying to find key tasks. Keeping stakeholders in mind, asking right questions for data driven decision making.
Problem trying to solve in this project is, how consumers are using their smart devices, which can be unlocked using given data. Furthermore, it will give growth opportunities for the company in global smart device market. Based on the findings, from this data analysis, and high-level recommendations, Bellabeat marketing strategy could be knitted accordingly.
Using the given data, finding business task such as identifying potential opportunities for growth and recommendations. Key stakeholders Urška Sršen (Co-founder and Chief Creative Officer) and Sando Mur (Co-founder and key member of the executive team) will be notified about the key finding in this project.
The key business tasks in this ask phase is, asking SMART questions for studying and understanding the project in data driven decision making.
In this stage I’ll be getting required data and prepare it for exploration.
The data is stored in https://www.kaggle.com/arashnic/fitbit (Public Domain, data set made available through Mobius). All available data are in 18 csv files. The data is reliable and original as it is primary data. It is comprehensive, current as well as cited. The data is stored in public domain and accessible to all at any time. As the data is recommended by Bellabeat, it is reliable. This data can be used to answer my questions expect the fact that the device being used by customers registered but the products could be used by someone else.
All data files are downloaded, which are in csv format. The data is now being sorted and filtered for processing which are credible as directed by Bellabeat to secure it. Below are the list of data sources used.
daily_raw <- read.csv("dailyActivity_merged.csv")
sleep_raw <- read.csv("sleepDay_date.csv")
weight_raw <- read.csv("weightLogInfo_merged.csv")
dailyCalories_raw <- read.csv("dailyCalories_merged.csv")
hourlyIntensities_raw <- read.csv("hourlyIntensities_merged.csv")
# install.packages("tidyverse")
library("tidyverse")
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.2 v dplyr 1.0.6
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# View the data in table
View(daily_raw)
View(sleep_raw)
View(weight_raw)
View(dailyCalories_raw)
View(hourlyIntensities_raw)
# Checking the internal structure data
str(daily_raw)
## 'data.frame': 940 obs. of 15 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDate : chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ TotalSteps : int 13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
## $ TotalDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : int 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : int 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : int 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : int 728 776 1218 726 773 539 1149 775 818 838 ...
## $ Calories : int 1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
str(sleep_raw)
## 'data.frame': 413 obs. of 6 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ SleepDay : chr "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
## $ SleepdayDate : chr "2016-04-12" "2016-04-13" "2016-04-15" "2016-04-16" ...
## $ TotalMinutesAsleep: int 327 384 412 340 700 304 360 325 361 430 ...
## $ TotalSleepRecords : int 1 2 1 2 1 1 1 1 1 1 ...
## $ TotalTimeInBed : int 346 407 442 367 712 320 377 364 384 449 ...
str(weight_raw)
## 'data.frame': 67 obs. of 8 variables:
## $ Id : num 1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
## $ Date : chr "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
## $ WeightKg : num 52.6 52.6 133.5 56.7 57.3 ...
## $ WeightPounds : num 116 116 294 125 126 ...
## $ Fat : int 22 NA NA NA NA 25 NA NA NA NA ...
## $ BMI : num 22.6 22.6 47.5 21.5 21.7 ...
## $ IsManualReport: chr "True" "True" "False" "True" ...
## $ LogId : num 1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...
str(dailyCalories_raw)
## 'data.frame': 940 obs. of 3 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDay: chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ Calories : int 1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
str(hourlyIntensities_raw)
## 'data.frame': 22099 obs. of 4 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityHour : chr "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
## $ TotalIntensity : int 20 8 7 0 0 0 0 0 13 30 ...
## $ AverageIntensity: num 0.333 0.133 0.117 0 0 ...
# head() and tail() function to read the first and last n rows of the given dataset
head(daily_raw)
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366 4/12/2016 13162 8.50 8.50
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 1.88 0.55
## 2 0 1.57 0.69
## 3 0 2.44 0.40
## 4 0 2.14 1.26
## 5 0 2.71 0.41
## 6 0 3.19 0.78
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 6.06 0 25
## 2 4.71 0 21
## 3 3.91 0 30
## 4 2.83 0 29
## 5 5.04 0 36
## 6 2.51 0 38
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 13 328 728 1985
## 2 19 217 776 1797
## 3 11 181 1218 1776
## 4 34 209 726 1745
## 5 10 221 773 1863
## 6 20 164 539 1728
head(sleep_raw)
## Id SleepDay SleepdayDate TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM 2016-04-12 327
## 2 1503960366 4/13/2016 12:00:00 AM 2016-04-13 384
## 3 1503960366 4/15/2016 12:00:00 AM 2016-04-15 412
## 4 1503960366 4/16/2016 12:00:00 AM 2016-04-16 340
## 5 1503960366 4/17/2016 12:00:00 AM 2016-04-17 700
## 6 1503960366 4/19/2016 12:00:00 AM 2016-04-19 304
## TotalSleepRecords TotalTimeInBed
## 1 1 346
## 2 2 407
## 3 1 442
## 4 2 367
## 5 1 712
## 6 1 320
head(weight_raw)
## Id Date WeightKg WeightPounds Fat BMI
## 1 1503960366 5/2/2016 11:59:59 PM 52.6 115.9631 22 22.65
## 2 1503960366 5/3/2016 11:59:59 PM 52.6 115.9631 NA 22.65
## 3 1927972279 4/13/2016 1:08:52 AM 133.5 294.3171 NA 47.54
## 4 2873212765 4/21/2016 11:59:59 PM 56.7 125.0021 NA 21.45
## 5 2873212765 5/12/2016 11:59:59 PM 57.3 126.3249 NA 21.69
## 6 4319703577 4/17/2016 11:59:59 PM 72.4 159.6147 25 27.45
## IsManualReport LogId
## 1 True 1.462234e+12
## 2 True 1.462320e+12
## 3 False 1.460510e+12
## 4 True 1.461283e+12
## 5 True 1.463098e+12
## 6 True 1.460938e+12
head(dailyCalories_raw)
## Id ActivityDay Calories
## 1 1503960366 4/12/2016 1985
## 2 1503960366 4/13/2016 1797
## 3 1503960366 4/14/2016 1776
## 4 1503960366 4/15/2016 1745
## 5 1503960366 4/16/2016 1863
## 6 1503960366 4/17/2016 1728
head(hourlyIntensities_raw)
## Id ActivityHour TotalIntensity AverageIntensity
## 1 1503960366 4/12/2016 12:00:00 AM 20 0.333333
## 2 1503960366 4/12/2016 1:00:00 AM 8 0.133333
## 3 1503960366 4/12/2016 2:00:00 AM 7 0.116667
## 4 1503960366 4/12/2016 3:00:00 AM 0 0.000000
## 5 1503960366 4/12/2016 4:00:00 AM 0 0.000000
## 6 1503960366 4/12/2016 5:00:00 AM 0 0.000000
tail(daily_raw)
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 935 8877689391 5/7/2016 12332 8.13 8.13
## 936 8877689391 5/8/2016 10686 8.11 8.11
## 937 8877689391 5/9/2016 20226 18.25 18.25
## 938 8877689391 5/10/2016 10733 8.15 8.15
## 939 8877689391 5/11/2016 21420 19.56 19.56
## 940 8877689391 5/12/2016 8064 6.12 6.12
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 935 0 0.08 0.96
## 936 0 1.08 0.20
## 937 0 11.10 0.80
## 938 0 1.35 0.46
## 939 0 13.22 0.41
## 940 0 1.82 0.04
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 935 6.99 0.00 105
## 936 6.80 0.00 17
## 937 6.24 0.05 73
## 938 6.28 0.00 18
## 939 5.89 0.00 88
## 940 4.25 0.00 23
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 935 28 271 1036 4142
## 936 4 245 1174 2847
## 937 19 217 1131 3710
## 938 11 224 1187 2832
## 939 12 213 1127 3832
## 940 1 137 770 1849
tail(sleep_raw)
## Id SleepDay SleepdayDate TotalMinutesAsleep
## 408 8792009665 4/29/2016 12:00:00 AM 2016-04-29 398
## 409 8792009665 4/30/2016 12:00:00 AM 2016-04-30 343
## 410 8792009665 5/1/2016 12:00:00 AM 2016-05-01 503
## 411 8792009665 5/2/2016 12:00:00 AM 2016-05-02 415
## 412 8792009665 5/3/2016 12:00:00 AM 2016-05-03 516
## 413 8792009665 5/4/2016 12:00:00 AM 2016-05-04 439
## TotalSleepRecords TotalTimeInBed
## 408 1 406
## 409 1 360
## 410 1 527
## 411 1 423
## 412 1 545
## 413 1 463
tail(weight_raw)
## Id Date WeightKg WeightPounds Fat BMI
## 62 8877689391 5/4/2016 6:48:22 AM 84.4 186.0702 NA 25.26
## 63 8877689391 5/6/2016 6:43:35 AM 85.0 187.3929 NA 25.44
## 64 8877689391 5/8/2016 7:35:53 AM 85.4 188.2748 NA 25.56
## 65 8877689391 5/9/2016 6:39:44 AM 85.5 188.4952 NA 25.61
## 66 8877689391 5/11/2016 6:51:47 AM 85.4 188.2748 NA 25.56
## 67 8877689391 5/12/2016 6:42:53 AM 84.0 185.1883 NA 25.14
## IsManualReport LogId
## 62 False 1.462345e+12
## 63 False 1.462517e+12
## 64 False 1.462693e+12
## 65 False 1.462776e+12
## 66 False 1.462950e+12
## 67 False 1.463035e+12
tail(dailyCalories_raw)
## Id ActivityDay Calories
## 935 8877689391 5/7/2016 4142
## 936 8877689391 5/8/2016 2847
## 937 8877689391 5/9/2016 3710
## 938 8877689391 5/10/2016 2832
## 939 8877689391 5/11/2016 3832
## 940 8877689391 5/12/2016 1849
tail(hourlyIntensities_raw)
## Id ActivityHour TotalIntensity AverageIntensity
## 22094 8877689391 5/12/2016 9:00:00 AM 4 0.066667
## 22095 8877689391 5/12/2016 10:00:00 AM 12 0.200000
## 22096 8877689391 5/12/2016 11:00:00 AM 29 0.483333
## 22097 8877689391 5/12/2016 12:00:00 PM 93 1.550000
## 22098 8877689391 5/12/2016 1:00:00 PM 6 0.100000
## 22099 8877689391 5/12/2016 2:00:00 PM 9 0.150000
# Selecting key variables from data set (dailyActivity_merged.csv)
dailyActivity <- daily_raw %>%
select(Id, ActivityDate, TotalSteps, TotalDistance, SedentaryMinutes, Calories) %>%
drop_na()
n_distinct(dailyActivity)
## [1] 940
# Selecting key variables from data set (sleepDay_merged.csv)
sleepDay <- sleep_raw %>%
select(Id, TotalMinutesAsleep, TotalTimeInBed)
n_distinct(sleepDay)
## [1] 410
# Selecting key variables from data set (weightLogInfo_merged.csv)
weightLogInfo <- weight_raw %>%
select(Id, WeightKg, BMI)
n_distinct(weightLogInfo)
## [1] 36
# Selecting key variables from data set (dailyCalories_merged.csv)
dailyCalories<- dailyCalories_raw %>%
select(Id, Calories)
n_distinct(dailyCalories)
## [1] 874
# Selecting key variables from data set (hourlyIntensities_merged.csv)
hourlyIntensities <- hourlyIntensities_raw %>%
select(Id, ActivityHour)
n_distinct(hourlyIntensities)
## [1] 22099
daily_raw %>%
select(TotalSteps,
TotalDistance,
SedentaryMinutes, Calories) %>%
summary()
## TotalSteps TotalDistance SedentaryMinutes Calories
## Min. : 0 Min. : 0.000 Min. : 0.0 Min. : 0
## 1st Qu.: 3790 1st Qu.: 2.620 1st Qu.: 729.8 1st Qu.:1828
## Median : 7406 Median : 5.245 Median :1057.5 Median :2134
## Mean : 7638 Mean : 5.490 Mean : 991.2 Mean :2304
## 3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.:1229.5 3rd Qu.:2793
## Max. :36019 Max. :28.030 Max. :1440.0 Max. :4900
daily_raw %>%
select(VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes) %>%
summary()
## VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes
## Min. : 0.00 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:127.0
## Median : 4.00 Median : 6.00 Median :199.0
## Mean : 21.16 Mean : 13.56 Mean :192.8
## 3rd Qu.: 32.00 3rd Qu.: 19.00 3rd Qu.:264.0
## Max. :210.00 Max. :143.00 Max. :518.0
sleep_raw %>%
select(TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed) %>%
summary()
## TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## Min. :1.000 Min. : 58.0 Min. : 61.0
## 1st Qu.:1.000 1st Qu.:361.0 1st Qu.:403.0
## Median :1.000 Median :433.0 Median :463.0
## Mean :1.119 Mean :419.5 Mean :458.6
## 3rd Qu.:1.000 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :3.000 Max. :796.0 Max. :961.0
weight_raw %>%
select(WeightKg, BMI) %>%
summary()
## WeightKg BMI
## Min. : 52.60 Min. :21.45
## 1st Qu.: 61.40 1st Qu.:23.96
## Median : 62.50 Median :24.39
## Mean : 72.04 Mean :25.19
## 3rd Qu.: 85.05 3rd Qu.:25.56
## Max. :133.50 Max. :47.54
hourlyIntensities_raw %>%
select(Id, ActivityHour, TotalIntensity) %>%
drop_na() %>%
group_by(Id, ActivityHour)
## # A tibble: 22,099 x 3
## # Groups: Id, ActivityHour [22,099]
## Id ActivityHour TotalIntensity
## <dbl> <chr> <int>
## 1 1503960366 4/12/2016 12:00:00 AM 20
## 2 1503960366 4/12/2016 1:00:00 AM 8
## 3 1503960366 4/12/2016 2:00:00 AM 7
## 4 1503960366 4/12/2016 3:00:00 AM 0
## 5 1503960366 4/12/2016 4:00:00 AM 0
## 6 1503960366 4/12/2016 5:00:00 AM 0
## 7 1503960366 4/12/2016 6:00:00 AM 0
## 8 1503960366 4/12/2016 7:00:00 AM 0
## 9 1503960366 4/12/2016 8:00:00 AM 13
## 10 1503960366 4/12/2016 9:00:00 AM 30
## # ... with 22,089 more rows
sleep_activity_merged <- merge(sleep_raw, dailyActivity, by=c('Id'))
str(sleep_activity_merged)
## 'data.frame': 12441 obs. of 11 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ SleepDay : chr "4/12/2016 12:00:00 AM" "4/12/2016 12:00:00 AM" "4/12/2016 12:00:00 AM" "4/12/2016 12:00:00 AM" ...
## $ SleepdayDate : chr "2016-04-12" "2016-04-12" "2016-04-12" "2016-04-12" ...
## $ TotalMinutesAsleep: int 327 327 327 327 327 327 327 327 327 327 ...
## $ TotalSleepRecords : int 1 1 1 1 1 1 1 1 1 1 ...
## $ TotalTimeInBed : int 346 346 346 346 346 346 346 346 346 346 ...
## $ ActivityDate : chr "5/7/2016" "5/6/2016" "5/1/2016" "4/30/2016" ...
## $ TotalSteps : int 11992 12159 10602 14673 13162 10735 15355 14070 13154 11181 ...
## $ TotalDistance : num 7.71 8.03 6.81 9.25 8.5 ...
## $ SedentaryMinutes : int 833 754 730 712 728 776 814 857 782 815 ...
## $ Calories : int 1821 1896 1820 1947 1985 1797 2013 1959 1898 1837 ...
dailyCalories_merged <- read.csv('dailyCalories_merged.csv')
ggplot(data = sleep_activity_merged, aes(x = TotalSteps, y = Calories)) +
geom_point(color='#5e040e', fill= '#d1e9f0')+
geom_smooth()+
labs(title="TotalSteps vs Calories")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
ggplot(data = sleep_activity_merged) +
geom_point(mapping = aes(x = TotalSteps, y = Calories, color = SedentaryMinutes))
ggplot(data = sleep_activity_merged) +
geom_point(mapping = aes(x = TotalSteps, y = Calories, color = SedentaryMinutes)) +
facet_wrap(~Id)
ggplot(data=sleep_raw, aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) +
geom_point(color='#05ad91') +
geom_smooth()+
labs(title="Minutes slept vs Time on bed")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data=sleep_raw) +
geom_point(mapping = aes(x=TotalMinutesAsleep, y=TotalTimeInBed))+
labs(title="Minutes slept vs Time on bed - Per ID") +
facet_wrap(~Id)
ggplot(data = sleep_activity_merged, aes(x=TotalDistance, y=Calories)) +
geom_point(color='#960B2E') +
labs(title="TotalDistance vs. Calories", x = "Total Distance", y = "Calories") +
stat_smooth()+
theme(plot.title = element_text(colour = 'Brown', hjust = .5))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
view(sleep_activity_merged)
n_distinct(sleep_activity_merged$Id)
## [1] 24
hourlyIntensities_raw$ActivityHour = as.POSIXct(hourlyIntensities_raw$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
#extracting time
hourlyIntensities_raw$Time <- format(hourlyIntensities_raw$ActivityHour, format = "%H:%M:%S")
#grouping by time and summarizing mean
hourlyIntensities_mean <- hourlyIntensities_raw%>%
group_by(Time) %>%
drop_na() %>%
summarise(total_mean = mean(TotalIntensity))
# Visualizing using bar chart to find the trend
ggplot(data=hourlyIntensities_mean) +
geom_bar(mapping = aes(x=Time, y = total_mean), stat= 'identity', fill='#960B2E') +
theme(
axis.text.x = element_text(angle = 45),
axis.line = element_line(size = .5, colour = '#87AED2', linetype = 5),
axis.text = element_text( angle = 90, color="#191970", size=10, face=2),
plot.title = element_text(hjust = .5)
) +
labs(title="Average Intensity VS Time", x = 'Time', y = 'Total Mean')
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.