This is my version of the Google Data Analytics : Case Study 2 , A ‘Wellness Technology’ Company case study.
Bellabeat A high-tech company that manufactures health-focused smart products. They said this case study will be an another ‘Tangible’ way to demonstrate my knowledge and skills so here we are :
I will be performing many real-world tasks of a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. “Bellabeat” is a successful small company, but they have the potential to become a larger player in the global smart device market.
I have joined this team six months ago and have been busy learning about Bellabeat’s mission and business goals — as well as how I, as a junior data analyst, can help Bellabeat achieve them.
Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. I have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights I discover will then help guide marketing strategy for the company. And I will be presenting data analysis to the Bellabeat executive team along with my high-level recommendations for Bellabeat’s marketing strategy.
Sršen asks me to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices. She then wants me to select one Bellabeat product to apply these insights to, in my presentation.
About the company:
Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused smart products. Sršen used her background as an artist to develop beautifully designed technology that informs and inspires women around the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits. Since it was founded in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women.
By 2016, Bellabeat had opened offices around the world and launched multiple products. Bellabeat products became available through a growing number of online retailers in addition to their own e-commerce channel on their website. The company has invested in traditional advertising media, such as radio, out-of-home billboards, print, and television, but focuses on digital marketing extensively. Bellabeat invests year-round in Google Search, maintaining active Facebook and Instagram pages, and consistently engages consumers on Twitter. Additionally, Bellabeat runs video ads on Youtube and display ads on the Google Display Network to support campaigns around key marketing dates.
Sršen knows that an analysis of Bellabeat’s available consumer data would reveal more opportunities for growth. She has asked the marketing analytics team to focus on a Bellabeat product and analyze smart device usage data in order to gain insight into how people are already using their smart devices. Then, using this information, she would like high-level recommendations for how these trends can inform Bellabeat marketing strategy.
Deliverable
I’ll be using Public Dataset to analyze and identify trends.
Reliability: This data is not reliable. There are no further information for example : margin of error , smart device type etc & small sample size has been used, which can limit the data analysis that can be done.
Originality: This is not an original dataset as it was originally collected from Amazon Mechanical Turk.
Comprehensiveness: This data is not comprehensive. There is no further information about the participants, gender, age, health state, etc. If the data is biased, then the insights from the analysis will be unfair and a complete time waste.
Current: This data was collected back in 2016, which is currently outdated.
Cited: Amazon Mechanical Turk created the dataset, but we have no information on whether this is a credible source.
Now the Datasets is clearly does not meet the ROCCC System. Therefore,insights from the Analysis might only provide some direction(s), I guess.
R is primarily used for statistical analysis and data visualization. So, I chose ‘RStudio’ to merge appropriate dataset for further Analysis.
# install.packages("tidyverse")
# install.packages("lubridate")
# install.packages("ggplot2")
# install.packages("janitor")
library(tidyverse)
library(lubridate)
library(ggplot2)
library(janitor)
setwd("D:/Case_Study/Data/Bellabeat/Fitabase Data 4.12.16-5.12.16")
hour_cal <- read.csv("hourlyCalories_merged.csv")
hour_inten <- read.csv("hourlyIntensities_merged.csv")
hour_steps <- read.csv("hourlySteps_merged.csv")
min_cal <- read.csv("minuteCaloriesNarrow_merged.csv")
min_cal_wide <- read.csv("minuteCaloriesWide_merged.csv")
min_inten <- read.csv("minuteIntensitiesNarrow_merged.csv")
min_inten_wide <- read.csv("minuteIntensitiesWide_merged.csv")
min_mets <- read.csv("minuteMetsNarrow_merged.csv")
min_sleep <- read.csv("minuteSleep_merged.csv")
min_steps <- read.csv("minuteStepsNarrow_merged.csv")
min_steps_wide <- read.csv("minuteStepsWide_merged.csv")
daily_sleep <- read.csv("sleepDay_merged.csv")
daily_act <- read.csv("dailyActivity_merged.csv")
daily_cal <- read.csv("dailyCalories_merged.csv")
daily_inten <- read.csv("dailyIntensities_merged.csv")
daily_steps <- read.csv("dailySteps_merged.csv")
weight_log <- read.csv("weightLogInfo_merged.csv")
heart_sec <- read.csv("heartrate_seconds_merged.csv")
daily_sleep <- daily_sleep %>%
clean_names() %>%
rename(act_date = sleep_day,
sleep_min = total_minutes_asleep,
inbed_min = total_time_in_bed) %>%
distinct()
daily_sleep$act_date <- as.Date(daily_sleep$act_date, format = "%m/%d/%Y %H:%M:%S %p")
str(daily_sleep)
## 'data.frame': 410 obs. of 5 variables:
## $ id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ act_date : Date, format: "2016-04-12" "2016-04-13" ...
## $ total_sleep_records: int 1 2 1 2 1 1 1 1 1 1 ...
## $ sleep_min : int 327 384 412 340 700 304 360 325 361 430 ...
## $ inbed_min : int 346 407 442 367 712 320 377 364 384 449 ...
summary(daily_sleep)
## id act_date total_sleep_records sleep_min
## Min. :1.504e+09 Min. :2016-04-12 Min. :1.00 Min. : 58.0
## 1st Qu.:3.977e+09 1st Qu.:2016-04-19 1st Qu.:1.00 1st Qu.:361.0
## Median :4.703e+09 Median :2016-04-27 Median :1.00 Median :432.5
## Mean :4.995e+09 Mean :2016-04-26 Mean :1.12 Mean :419.2
## 3rd Qu.:6.962e+09 3rd Qu.:2016-05-04 3rd Qu.:1.00 3rd Qu.:490.0
## Max. :8.792e+09 Max. :2016-05-12 Max. :3.00 Max. :796.0
## inbed_min
## Min. : 61.0
## 1st Qu.:403.8
## Median :463.0
## Mean :458.5
## 3rd Qu.:526.0
## Max. :961.0
daily_act <- daily_act %>%
clean_names() %>%
rename(act_date = activity_date) %>%
mutate(act_date = as.Date(act_date, format = "%m/%d/%Y")) %>%
distinct()
str(daily_act)
## 'data.frame': 940 obs. of 15 variables:
## $ id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ act_date : Date, format: "2016-04-12" "2016-04-13" ...
## $ total_steps : int 13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
## $ total_distance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ tracker_distance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ logged_activities_distance: num 0 0 0 0 0 0 0 0 0 0 ...
## $ very_active_distance : num 1.88 1.57 2.44 2.14 2.71 ...
## $ moderately_active_distance: num 0.55 0.69 0.4 1.26 0.41 ...
## $ light_active_distance : num 6.06 4.71 3.91 2.83 5.04 ...
## $ sedentary_active_distance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ very_active_minutes : int 25 21 30 29 36 38 42 50 28 19 ...
## $ fairly_active_minutes : int 13 19 11 34 10 20 16 31 12 8 ...
## $ lightly_active_minutes : int 328 217 181 209 221 164 233 264 205 211 ...
## $ sedentary_minutes : int 728 776 1218 726 773 539 1149 775 818 838 ...
## $ calories : int 1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
summary(daily_act)
## id act_date total_steps total_distance
## Min. :1.504e+09 Min. :2016-04-12 Min. : 0 Min. : 0.000
## 1st Qu.:2.320e+09 1st Qu.:2016-04-19 1st Qu.: 3790 1st Qu.: 2.620
## Median :4.445e+09 Median :2016-04-26 Median : 7406 Median : 5.245
## Mean :4.855e+09 Mean :2016-04-26 Mean : 7638 Mean : 5.490
## 3rd Qu.:6.962e+09 3rd Qu.:2016-05-04 3rd Qu.:10727 3rd Qu.: 7.713
## Max. :8.878e+09 Max. :2016-05-12 Max. :36019 Max. :28.030
## tracker_distance logged_activities_distance very_active_distance
## Min. : 0.000 Min. :0.0000 Min. : 0.000
## 1st Qu.: 2.620 1st Qu.:0.0000 1st Qu.: 0.000
## Median : 5.245 Median :0.0000 Median : 0.210
## Mean : 5.475 Mean :0.1082 Mean : 1.503
## 3rd Qu.: 7.710 3rd Qu.:0.0000 3rd Qu.: 2.053
## Max. :28.030 Max. :4.9421 Max. :21.920
## moderately_active_distance light_active_distance sedentary_active_distance
## Min. :0.0000 Min. : 0.000 Min. :0.000000
## 1st Qu.:0.0000 1st Qu.: 1.945 1st Qu.:0.000000
## Median :0.2400 Median : 3.365 Median :0.000000
## Mean :0.5675 Mean : 3.341 Mean :0.001606
## 3rd Qu.:0.8000 3rd Qu.: 4.782 3rd Qu.:0.000000
## Max. :6.4800 Max. :10.710 Max. :0.110000
## very_active_minutes fairly_active_minutes lightly_active_minutes
## Min. : 0.00 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:127.0
## Median : 4.00 Median : 6.00 Median :199.0
## Mean : 21.16 Mean : 13.56 Mean :192.8
## 3rd Qu.: 32.00 3rd Qu.: 19.00 3rd Qu.:264.0
## Max. :210.00 Max. :143.00 Max. :518.0
## sedentary_minutes calories
## Min. : 0.0 Min. : 0
## 1st Qu.: 729.8 1st Qu.:1828
## Median :1057.5 Median :2134
## Mean : 991.2 Mean :2304
## 3rd Qu.:1229.5 3rd Qu.:2793
## Max. :1440.0 Max. :4900
daily_cal <- daily_cal %>%
clean_names() %>%
rename(act_date = activity_day) %>%
mutate(act_date = as.Date(act_date, format = "%m/%d/%Y")) %>%
distinct()
str(daily_cal)
## 'data.frame': 940 obs. of 3 variables:
## $ id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ act_date: Date, format: "2016-04-12" "2016-04-13" ...
## $ calories: int 1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
summary(daily_cal)
## id act_date calories
## Min. :1.504e+09 Min. :2016-04-12 Min. : 0
## 1st Qu.:2.320e+09 1st Qu.:2016-04-19 1st Qu.:1828
## Median :4.445e+09 Median :2016-04-26 Median :2134
## Mean :4.855e+09 Mean :2016-04-26 Mean :2304
## 3rd Qu.:6.962e+09 3rd Qu.:2016-05-04 3rd Qu.:2793
## Max. :8.878e+09 Max. :2016-05-12 Max. :4900
daily_inten <- daily_inten %>%
clean_names() %>%
rename(act_date = activity_day) %>%
mutate(act_date = as.Date(act_date, format = "%m/%d/%Y")) %>%
distinct()
str(daily_inten)
## 'data.frame': 940 obs. of 10 variables:
## $ id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ act_date : Date, format: "2016-04-12" "2016-04-13" ...
## $ sedentary_minutes : int 728 776 1218 726 773 539 1149 775 818 838 ...
## $ lightly_active_minutes : int 328 217 181 209 221 164 233 264 205 211 ...
## $ fairly_active_minutes : int 13 19 11 34 10 20 16 31 12 8 ...
## $ very_active_minutes : int 25 21 30 29 36 38 42 50 28 19 ...
## $ sedentary_active_distance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ light_active_distance : num 6.06 4.71 3.91 2.83 5.04 ...
## $ moderately_active_distance: num 0.55 0.69 0.4 1.26 0.41 ...
## $ very_active_distance : num 1.88 1.57 2.44 2.14 2.71 ...
summary(daily_inten)
## id act_date sedentary_minutes
## Min. :1.504e+09 Min. :2016-04-12 Min. : 0.0
## 1st Qu.:2.320e+09 1st Qu.:2016-04-19 1st Qu.: 729.8
## Median :4.445e+09 Median :2016-04-26 Median :1057.5
## Mean :4.855e+09 Mean :2016-04-26 Mean : 991.2
## 3rd Qu.:6.962e+09 3rd Qu.:2016-05-04 3rd Qu.:1229.5
## Max. :8.878e+09 Max. :2016-05-12 Max. :1440.0
## lightly_active_minutes fairly_active_minutes very_active_minutes
## Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.:127.0 1st Qu.: 0.00 1st Qu.: 0.00
## Median :199.0 Median : 6.00 Median : 4.00
## Mean :192.8 Mean : 13.56 Mean : 21.16
## 3rd Qu.:264.0 3rd Qu.: 19.00 3rd Qu.: 32.00
## Max. :518.0 Max. :143.00 Max. :210.00
## sedentary_active_distance light_active_distance moderately_active_distance
## Min. :0.000000 Min. : 0.000 Min. :0.0000
## 1st Qu.:0.000000 1st Qu.: 1.945 1st Qu.:0.0000
## Median :0.000000 Median : 3.365 Median :0.2400
## Mean :0.001606 Mean : 3.341 Mean :0.5675
## 3rd Qu.:0.000000 3rd Qu.: 4.782 3rd Qu.:0.8000
## Max. :0.110000 Max. :10.710 Max. :6.4800
## very_active_distance
## Min. : 0.000
## 1st Qu.: 0.000
## Median : 0.210
## Mean : 1.503
## 3rd Qu.: 2.053
## Max. :21.920
daily_steps <- daily_steps %>%
clean_names() %>%
rename(act_date = activity_day) %>%
mutate(act_date = as.Date(act_date, format = "%m/%d/%Y")) %>%
distinct()
str(daily_steps)
## 'data.frame': 940 obs. of 3 variables:
## $ id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ act_date : Date, format: "2016-04-12" "2016-04-13" ...
## $ step_total: int 13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
summary(daily_steps)
## id act_date step_total
## Min. :1.504e+09 Min. :2016-04-12 Min. : 0
## 1st Qu.:2.320e+09 1st Qu.:2016-04-19 1st Qu.: 3790
## Median :4.445e+09 Median :2016-04-26 Median : 7406
## Mean :4.855e+09 Mean :2016-04-26 Mean : 7638
## 3rd Qu.:6.962e+09 3rd Qu.:2016-05-04 3rd Qu.:10727
## Max. :8.878e+09 Max. :2016-05-12 Max. :36019
min_cal <- min_cal %>%
clean_names() %>%
rename(act_date = activity_minute)%>%
distinct()
min_cal$date <- mdy_hms(min_cal$act_date)
min_cal$time <- format(as.POSIXct(min_cal$date), format = "%H:%M %p")
min_cal <- min_cal %>%
select(-c(act_date))
str(min_cal)
## 'data.frame': 1325580 obs. of 4 variables:
## $ id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ calories: num 0.786 0.786 0.786 0.786 0.786 ...
## $ date : POSIXct, format: "2016-04-12 00:00:00" "2016-04-12 00:01:00" ...
## $ time : chr "00:00 AM" "00:01 AM" "00:02 AM" "00:03 AM" ...
summary(min_cal)
## id calories date
## Min. :1.504e+09 Min. : 0.0000 Min. :2016-04-12 00:00:00.00
## 1st Qu.:2.320e+09 1st Qu.: 0.9357 1st Qu.:2016-04-19 01:51:00.00
## Median :4.445e+09 Median : 1.2176 Median :2016-04-26 06:27:00.00
## Mean :4.848e+09 Mean : 1.6231 Mean :2016-04-26 12:09:55.15
## 3rd Qu.:6.962e+09 3rd Qu.: 1.4327 3rd Qu.:2016-05-03 18:55:00.00
## Max. :8.878e+09 Max. :19.7499 Max. :2016-05-12 15:59:00.00
## time
## Length:1325580
## Class :character
## Mode :character
##
##
##
min_inten <- min_inten %>%
clean_names() %>%
rename(act_date = activity_minute) %>%
distinct()
min_inten$date <- mdy_hms(min_inten$act_date)
min_inten$time <- format(as.POSIXct(min_inten$date), format = "%H:%M %p")
min_inten <- min_inten %>%
select(-c(act_date))
str(min_inten)
## 'data.frame': 1325580 obs. of 4 variables:
## $ id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ intensity: int 0 0 0 0 0 0 0 0 0 0 ...
## $ date : POSIXct, format: "2016-04-12 00:00:00" "2016-04-12 00:01:00" ...
## $ time : chr "00:00 AM" "00:01 AM" "00:02 AM" "00:03 AM" ...
summary(min_inten)
## id intensity date
## Min. :1.504e+09 Min. :0.0000 Min. :2016-04-12 00:00:00.00
## 1st Qu.:2.320e+09 1st Qu.:0.0000 1st Qu.:2016-04-19 01:51:00.00
## Median :4.445e+09 Median :0.0000 Median :2016-04-26 06:27:00.00
## Mean :4.848e+09 Mean :0.2006 Mean :2016-04-26 12:09:55.15
## 3rd Qu.:6.962e+09 3rd Qu.:0.0000 3rd Qu.:2016-05-03 18:55:00.00
## Max. :8.878e+09 Max. :3.0000 Max. :2016-05-12 15:59:00.00
## time
## Length:1325580
## Class :character
## Mode :character
##
##
##
min_mets <- min_mets %>%
clean_names() %>%
rename(act_date = activity_minute) %>%
rename(mets = me_ts) %>%
distinct()
min_mets$date <- mdy_hms(min_mets$act_date)
min_mets$time <- format(as.POSIXct(min_mets$date), format = "%H:%M %p")
min_mets <- min_mets %>%
select(-c(act_date))
str(min_mets)
## 'data.frame': 1325580 obs. of 4 variables:
## $ id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ mets: int 10 10 10 10 10 12 12 12 12 12 ...
## $ date: POSIXct, format: "2016-04-12 00:00:00" "2016-04-12 00:01:00" ...
## $ time: chr "00:00 AM" "00:01 AM" "00:02 AM" "00:03 AM" ...
summary(min_mets)
## id mets date
## Min. :1.504e+09 Min. : 0.00 Min. :2016-04-12 00:00:00.00
## 1st Qu.:2.320e+09 1st Qu.: 10.00 1st Qu.:2016-04-19 01:51:00.00
## Median :4.445e+09 Median : 10.00 Median :2016-04-26 06:27:00.00
## Mean :4.848e+09 Mean : 14.69 Mean :2016-04-26 12:09:55.15
## 3rd Qu.:6.962e+09 3rd Qu.: 11.00 3rd Qu.:2016-05-03 18:55:00.00
## Max. :8.878e+09 Max. :157.00 Max. :2016-05-12 15:59:00.00
## time
## Length:1325580
## Class :character
## Mode :character
##
##
##
min_sleep <- min_sleep %>%
clean_names() %>%
rename(act_date = date) %>%
distinct()
min_sleep$date <- mdy_hms(min_sleep$act_date)
min_sleep$time <- format(as.POSIXct(min_sleep$date), format = "%H:%M %p")
min_sleep <- min_sleep %>%
select(-c(act_date))
str(min_sleep)
## 'data.frame': 187978 obs. of 5 variables:
## $ id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ value : int 3 2 1 1 1 1 1 2 2 2 ...
## $ log_id: num 1.14e+10 1.14e+10 1.14e+10 1.14e+10 1.14e+10 ...
## $ date : POSIXct, format: "2016-04-12 02:47:30" "2016-04-12 02:48:30" ...
## $ time : chr "02:47 AM" "02:48 AM" "02:49 AM" "02:50 AM" ...
summary(min_sleep)
## id value log_id
## Min. :1.504e+09 Min. :1.000 Min. :1.137e+10
## 1st Qu.:3.977e+09 1st Qu.:1.000 1st Qu.:1.144e+10
## Median :4.703e+09 Median :1.000 Median :1.150e+10
## Mean :4.997e+09 Mean :1.096 Mean :1.150e+10
## 3rd Qu.:6.962e+09 3rd Qu.:1.000 3rd Qu.:1.155e+10
## Max. :8.792e+09 Max. :3.000 Max. :1.162e+10
## date time
## Min. :2016-04-11 20:48:00.00 Length:187978
## 1st Qu.:2016-04-19 02:48:00.00 Class :character
## Median :2016-04-26 21:48:00.00 Mode :character
## Mean :2016-04-26 13:31:23.11
## 3rd Qu.:2016-05-03 23:47:00.00
## Max. :2016-05-12 09:56:00.00
rm(min_sleep)
min_steps <- min_steps %>%
clean_names() %>%
rename(act_date = activity_minute) %>%
distinct()
min_steps$date <- mdy_hms(min_steps$act_date)
min_steps$time <- format(as.POSIXct(min_steps$date), format = "%H:%M %p")
min_steps <- min_steps %>%
select(-c(act_date))
str(min_steps)
## 'data.frame': 1325580 obs. of 4 variables:
## $ id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ steps: int 0 0 0 0 0 0 0 0 0 0 ...
## $ date : POSIXct, format: "2016-04-12 00:00:00" "2016-04-12 00:01:00" ...
## $ time : chr "00:00 AM" "00:01 AM" "00:02 AM" "00:03 AM" ...
summary(min_steps)
## id steps date
## Min. :1.504e+09 Min. : 0.000 Min. :2016-04-12 00:00:00.00
## 1st Qu.:2.320e+09 1st Qu.: 0.000 1st Qu.:2016-04-19 01:51:00.00
## Median :4.445e+09 Median : 0.000 Median :2016-04-26 06:27:00.00
## Mean :4.848e+09 Mean : 5.336 Mean :2016-04-26 12:09:55.15
## 3rd Qu.:6.962e+09 3rd Qu.: 0.000 3rd Qu.:2016-05-03 18:55:00.00
## Max. :8.878e+09 Max. :220.000 Max. :2016-05-12 15:59:00.00
## time
## Length:1325580
## Class :character
## Mode :character
##
##
##
heart_sec <- heart_sec %>%
clean_names() %>%
rename(act_date = time, bpm = value) %>%
distinct()
heart_sec$date <- mdy_hms(heart_sec$act_date)
heart_sec$time <- format(as.POSIXct(heart_sec$date), format = "%H:%M:%S %p")
heart_sec <- heart_sec %>%
select(-c(act_date))
str(heart_sec)
## 'data.frame': 2483658 obs. of 4 variables:
## $ id : num 2.02e+09 2.02e+09 2.02e+09 2.02e+09 2.02e+09 ...
## $ bpm : int 97 102 105 103 101 95 91 93 94 93 ...
## $ date: POSIXct, format: "2016-04-12 07:21:00" "2016-04-12 07:21:05" ...
## $ time: chr "07:21:00 AM" "07:21:05 AM" "07:21:10 AM" "07:21:20 AM" ...
summary(heart_sec)
## id bpm date
## Min. :2.022e+09 Min. : 36.00 Min. :2016-04-12 00:00:00.00
## 1st Qu.:4.388e+09 1st Qu.: 63.00 1st Qu.:2016-04-19 06:18:10.00
## Median :5.554e+09 Median : 73.00 Median :2016-04-26 20:28:50.00
## Mean :5.514e+09 Mean : 77.33 Mean :2016-04-26 19:43:52.24
## 3rd Qu.:6.962e+09 3rd Qu.: 88.00 3rd Qu.:2016-05-04 08:00:20.00
## Max. :8.878e+09 Max. :203.00 Max. :2016-05-12 16:20:00.00
## time
## Length:2483658
## Class :character
## Mode :character
##
##
##
hour_steps <- hour_steps %>%
clean_names() %>%
rename(act_date = activity_hour) %>%
distinct()
hour_steps$date <- mdy_hms(hour_steps$act_date)
hour_steps$time <- format(as.POSIXct(hour_steps$date), format = "%H:%M %p")
hour_steps <- hour_steps %>%
select(-c(act_date))
str(hour_steps)
## 'data.frame': 22099 obs. of 4 variables:
## $ id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ step_total: int 373 160 151 0 0 0 0 0 250 1864 ...
## $ date : POSIXct, format: "2016-04-12 00:00:00" "2016-04-12 01:00:00" ...
## $ time : chr "00:00 AM" "01:00 AM" "02:00 AM" "03:00 AM" ...
summary(hour_steps)
## id step_total date
## Min. :1.504e+09 Min. : 0.0 Min. :2016-04-12 00:00:00.00
## 1st Qu.:2.320e+09 1st Qu.: 0.0 1st Qu.:2016-04-19 01:00:00.00
## Median :4.445e+09 Median : 40.0 Median :2016-04-26 06:00:00.00
## Mean :4.848e+09 Mean : 320.2 Mean :2016-04-26 11:46:42.58
## 3rd Qu.:6.962e+09 3rd Qu.: 357.0 3rd Qu.:2016-05-03 19:00:00.00
## Max. :8.878e+09 Max. :10554.0 Max. :2016-05-12 15:00:00.00
## time
## Length:22099
## Class :character
## Mode :character
##
##
##
hour_inten <- hour_inten %>%
clean_names() %>%
rename(act_date = activity_hour) %>%
distinct()
hour_inten$date <- mdy_hms(hour_inten$act_date)
hour_inten$time <- format(as.POSIXct(hour_inten$date), format = "%H:%M %p")
hour_inten <- hour_inten %>%
select(-c(act_date))
str(hour_inten)
## 'data.frame': 22099 obs. of 5 variables:
## $ id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ total_intensity : int 20 8 7 0 0 0 0 0 13 30 ...
## $ average_intensity: num 0.333 0.133 0.117 0 0 ...
## $ date : POSIXct, format: "2016-04-12 00:00:00" "2016-04-12 01:00:00" ...
## $ time : chr "00:00 AM" "01:00 AM" "02:00 AM" "03:00 AM" ...
summary(hour_inten)
## id total_intensity average_intensity
## Min. :1.504e+09 Min. : 0.00 Min. :0.0000
## 1st Qu.:2.320e+09 1st Qu.: 0.00 1st Qu.:0.0000
## Median :4.445e+09 Median : 3.00 Median :0.0500
## Mean :4.848e+09 Mean : 12.04 Mean :0.2006
## 3rd Qu.:6.962e+09 3rd Qu.: 16.00 3rd Qu.:0.2667
## Max. :8.878e+09 Max. :180.00 Max. :3.0000
## date time
## Min. :2016-04-12 00:00:00.00 Length:22099
## 1st Qu.:2016-04-19 01:00:00.00 Class :character
## Median :2016-04-26 06:00:00.00 Mode :character
## Mean :2016-04-26 11:46:42.58
## 3rd Qu.:2016-05-03 19:00:00.00
## Max. :2016-05-12 15:00:00.00
hour_cal <- hour_cal %>%
clean_names() %>%
rename(act_date = activity_hour)%>%
distinct()
hour_cal$date <- mdy_hms(hour_cal$act_date)
hour_cal$time <- format(as.POSIXct(hour_cal$date), format = "%H:%M %p")
hour_cal <- hour_cal %>%
select(-c(act_date))
str(hour_cal)
## 'data.frame': 22099 obs. of 4 variables:
## $ id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ calories: int 81 61 59 47 48 48 48 47 68 141 ...
## $ date : POSIXct, format: "2016-04-12 00:00:00" "2016-04-12 01:00:00" ...
## $ time : chr "00:00 AM" "01:00 AM" "02:00 AM" "03:00 AM" ...
summary(hour_cal)
## id calories date
## Min. :1.504e+09 Min. : 42.00 Min. :2016-04-12 00:00:00.00
## 1st Qu.:2.320e+09 1st Qu.: 63.00 1st Qu.:2016-04-19 01:00:00.00
## Median :4.445e+09 Median : 83.00 Median :2016-04-26 06:00:00.00
## Mean :4.848e+09 Mean : 97.39 Mean :2016-04-26 11:46:42.58
## 3rd Qu.:6.962e+09 3rd Qu.:108.00 3rd Qu.:2016-05-03 19:00:00.00
## Max. :8.878e+09 Max. :948.00 Max. :2016-05-12 15:00:00.00
## time
## Length:22099
## Class :character
## Mode :character
##
##
##
hourly_data <- hour_cal %>%
left_join(hour_inten, by = c("id", "date", "time")) %>%
left_join(hour_steps, by = c("id", "date", "time")) %>%
arrange(time) %>%
distinct()
hourly_data <- hourly_data %>%
select(-c(average_intensity))
str(hourly_data)
## 'data.frame': 22099 obs. of 6 variables:
## $ id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ calories : int 81 69 56 60 77 47 82 47 54 54 ...
## $ date : POSIXct, format: "2016-04-12 00:00:00" "2016-04-13 00:00:00" ...
## $ time : chr "00:00 AM" "00:00 AM" "00:00 AM" "00:00 AM" ...
## $ total_intensity: int 20 14 4 6 15 0 21 0 2 2 ...
## $ step_total : int 373 144 81 83 459 0 416 0 16 17 ...
summary(hourly_data)
## id calories date
## Min. :1.504e+09 Min. : 42.00 Min. :2016-04-12 00:00:00.00
## 1st Qu.:2.320e+09 1st Qu.: 63.00 1st Qu.:2016-04-19 01:00:00.00
## Median :4.445e+09 Median : 83.00 Median :2016-04-26 06:00:00.00
## Mean :4.848e+09 Mean : 97.39 Mean :2016-04-26 11:46:42.58
## 3rd Qu.:6.962e+09 3rd Qu.:108.00 3rd Qu.:2016-05-03 19:00:00.00
## Max. :8.878e+09 Max. :948.00 Max. :2016-05-12 15:00:00.00
## time total_intensity step_total
## Length:22099 Min. : 0.00 Min. : 0.0
## Class :character 1st Qu.: 0.00 1st Qu.: 0.0
## Mode :character Median : 3.00 Median : 40.0
## Mean : 12.04 Mean : 320.2
## 3rd Qu.: 16.00 3rd Qu.: 357.0
## Max. :180.00 Max. :10554.0
minute_data <- min_mets %>%
left_join(min_cal, by = c("id", "date", "time")) %>%
left_join(min_inten, by = c("id", "date", "time")) %>%
left_join(min_steps, by = c("id", "date", "time")) %>%
distinct()
str(minute_data)
## 'data.frame': 1325580 obs. of 7 variables:
## $ id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ mets : int 10 10 10 10 10 12 12 12 12 12 ...
## $ date : POSIXct, format: "2016-04-12 00:00:00" "2016-04-12 00:01:00" ...
## $ time : chr "00:00 AM" "00:01 AM" "00:02 AM" "00:03 AM" ...
## $ calories : num 0.786 0.786 0.786 0.786 0.786 ...
## $ intensity: int 0 0 0 0 0 0 0 0 0 0 ...
## $ steps : int 0 0 0 0 0 0 0 0 0 0 ...
summary(minute_data)
## id mets date
## Min. :1.504e+09 Min. : 0.00 Min. :2016-04-12 00:00:00.00
## 1st Qu.:2.320e+09 1st Qu.: 10.00 1st Qu.:2016-04-19 01:51:00.00
## Median :4.445e+09 Median : 10.00 Median :2016-04-26 06:27:00.00
## Mean :4.848e+09 Mean : 14.69 Mean :2016-04-26 12:09:55.15
## 3rd Qu.:6.962e+09 3rd Qu.: 11.00 3rd Qu.:2016-05-03 18:55:00.00
## Max. :8.878e+09 Max. :157.00 Max. :2016-05-12 15:59:00.00
## time calories intensity steps
## Length:1325580 Min. : 0.0000 Min. :0.0000 Min. : 0.000
## Class :character 1st Qu.: 0.9357 1st Qu.:0.0000 1st Qu.: 0.000
## Mode :character Median : 1.2176 Median :0.0000 Median : 0.000
## Mean : 1.6231 Mean :0.2006 Mean : 5.336
## 3rd Qu.: 1.4327 3rd Qu.:0.0000 3rd Qu.: 0.000
## Max. :19.7499 Max. :3.0000 Max. :220.000
1. daily_act
daily_act <- daily_act %>%
select(c(id, act_date,
total_steps,
sedentary_minutes,
calories))
str(daily_act)
## 'data.frame': 940 obs. of 5 variables:
## $ id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ act_date : Date, format: "2016-04-12" "2016-04-13" ...
## $ total_steps : int 13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
## $ sedentary_minutes: int 728 776 1218 726 773 539 1149 775 818 838 ...
## $ calories : int 1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
daily_act %>%
ggplot(aes(sedentary_minutes, total_steps,
color = sedentary_minutes)) +
geom_point(size = 2, alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE, color = 'purple') +
labs( x = "Total Sedentary Minutes", y = "Total Steps Taken",
color = "Sedentary Minutes",
title = "Relation Between Daily Sedentary Time By Steps Taken ",
caption = "Data Analyst : JP")+
annotate("text", x=220, y= 30000, label= "R^2 = 0.10" , color= "red",
fontface = "bold" , size = 5, angle = 25) +
theme_bw()
## `geom_smooth()` using formula = 'y ~ x'
summary(lm(sedentary_minutes ~ total_steps, daily_act))$r.squared
## [1] 0.1072455
daily_act %>%
ggplot(aes(calories, total_steps, color = sedentary_minutes)) +
geom_point(size = 2, alpha = 0.4) +
geom_smooth(method = 'lm' , se = FALSE, color = 'purple') +
labs(x = "Calories",
y = "Total Steps Taken",
color = "Sedentary Minutes",
title = "Relationship Between Calories by Steps Taken",
subtitle = "Linear Regression Model has Small fit for this relationship",
caption = "Data Analyst : JP") +
annotate("text", x=500, y= 30000, label = "R^2 = 0.34", color = "darkgreen",
fontface = "bold", size = 5, angle = 25 ) +
theme_bw()
## `geom_smooth()` using formula = 'y ~ x'
summary(lm(calories ~ total_steps , daily_act))$r.squared
## [1] 0.3499528
daily_sleep %>%
ggplot(aes(sleep_min, inbed_min, color = factor(total_sleep_records))) +
geom_point(size = 2, alpha = 0.5) +
geom_smooth(method = 'lm' , se = FALSE, color = "#330000") +
labs(x = "Total Minutes Sleep",
y = "Total Minute in Bed",
color = "Sleep(s)",
title = "Relationship Between Sleep vs In Bed Time",
subtitle = "Linear Regression Model has Strong fit for this relationship",
caption = "Data Analyst : JP") +
annotate("text", x=175, y= 775, label = "R^2 = 0.86", color = "darkgreen",
fontface = "bold", size = 5, angle = 25 ) +
theme_bw()
## `geom_smooth()` using formula = 'y ~ x'
summary(lm(sleep_min~inbed_min, daily_sleep))$r.squared
## [1] 0.8656858
daily_sleep_act <- daily_sleep %>%
left_join(daily_act, by = c("id", "act_date")) %>%
distinct()
str(daily_sleep_act)
## 'data.frame': 410 obs. of 8 variables:
## $ id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ act_date : Date, format: "2016-04-12" "2016-04-13" ...
## $ total_sleep_records: int 1 2 1 2 1 1 1 1 1 1 ...
## $ sleep_min : int 327 384 412 340 700 304 360 325 361 430 ...
## $ inbed_min : int 346 407 442 367 712 320 377 364 384 449 ...
## $ total_steps : int 13162 10735 9762 12669 9705 15506 10544 9819 14371 10039 ...
## $ sedentary_minutes : int 728 776 726 773 539 775 818 838 732 709 ...
## $ calories : int 1985 1797 1745 1863 1728 2035 1786 1775 1949 1788 ...
rm(daily_sleep_act)
2. hourly_data
hourly_data %>%
filter(step_total < 6000) %>%
ggplot(aes(total_intensity, step_total)) +
geom_point(color = "blue", size = 2, alpha = 0.3)+
geom_smooth(method = 'lm', se = FALSE, color = "red") +
labs(x = "Intensity",
y = "Steps Taken",
title = "Relationship Between Intensity & Total Steps",
subtitle = "Linear Regression Model has Strong fit for this relationship",
caption = "Data Analyst : JP") +
annotate("text", x=10, y= 5000, label = "R^2 = 0.80", color = "darkgreen",
fontface = "bold", size = 5, angle = 25 ) +
theme_bw()
## `geom_smooth()` using formula = 'y ~ x'
summary(lm(total_intensity~step_total, hourly_data))$r.squared
## [1] 0.8027856
hourly_data %>%
filter(calories < 600) %>%
ggplot(aes(total_intensity, calories)) +
geom_point(color = "blue", size = 2, alpha = 0.3)+
geom_smooth(method = 'lm', se = FALSE, color = "red") +
labs(x = "Intensity",
y = "Calories",
title = "Relationship Between Intensity & Calories",
subtitle = "Linear Regression Model has Strong fit for this relationship",
caption = "Data Analyst : JP") +
annotate("text", x=10, y= 475, label = "R^2 = 0.80", color = "darkgreen",
fontface = "bold", size = 5, angle = 25 ) +
theme_bw()
## `geom_smooth()` using formula = 'y ~ x'
summary(lm(total_intensity~calories, hourly_data))$r.squared
## [1] 0.8039204
hourly_data %>%
filter(calories < 600) %>%
ggplot(aes(step_total, calories)) +
geom_point(color = "blue", size = 2, alpha = 0.3)+
geom_smooth(method = 'lm', se = FALSE, color = "red") +
labs(x = "Steps Taken",
y = "Calories",
title = "Relationship Between Steps Taken & Calories",
subtitle = "Linear Regression Model has Strong fit for this relationship",
caption = "Data Analyst : JP") +
annotate("text", x=6800, y= 150, label = "R^2 = 0.66", color = "darkgreen",
fontface = "bold", size = 5, angle = 25 ) +
theme_bw()
## `geom_smooth()` using formula = 'y ~ x'
summary(lm(step_total~calories, hourly_data))$r.squared
## [1] 0.6641728
3. Minute Data
minute_data %>%
ggplot(aes(mets, calories)) +
geom_line(color = "blue", size = 0.5, alpha = 0.3)+
geom_smooth(method = 'lm', se = FALSE, color = "red") +
labs(x = "METs",
y = "Calories",
title = "Relationship Between METs & Calories",
subtitle = "Linear Regression Model has Strong fit for this relationship",
caption = "Data Analyst : JP") +
annotate("text", x=25, y= 17, label = "R^2 = 0.91", color = "darkgreen",
fontface = "bold", size = 5, angle = 25 ) +
theme_bw()
## `geom_smooth()` using formula = 'y ~ x'
summary(lm(mets~calories, minute_data))$r.squared
## [1] 0.9138607
Urška Sršen & the Team
Larger Sample size & more extended period of data is needed to get in-depth precise statistical analysis.
Data collection required from primary/secondary data sources just to increases credibility and reliability of the datasets.