About the company: Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused smart products.
Product: Bellabeat app - The Bellabeat app provides users with health data - related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
BUINESS PROBLEM Analyze the smart device usage data in order to gain insight into how consumers use these smart devices.
ASK PHASE: The problem this project is trying to solve is investigating and understanding some trends in the smart device usage. It also involves understanding how these trends apply to Bellabeat customers and how the trends could help influence Bellabeat marketing strategy. Stakeholders: Urška Sršen - Cofounder & Chief Creative Officer of Bellabeat
PREPARE PHASE: The data is stored is available on the Kaggle website. It is stored in form of a csv and long format. There seems to be an issue with bias and credibility. On one hand, the data doesn’t give information about the demographic or work force lifestyle of the respondents involved, so this makes it difficult to make concrete decision based on the derived analysis. On another hand, few of the data do not have detailed metadata or even defined headings to identify the content in the dataset which makes some of the dataset unusable and unreliable. This imposed a limitation to the useability of some of these datasets.
As regards addressing licensing, privacy, security, and accessibility, the data was licensed for available usage by Kaggle, the information about the samples were not revealed in the dataset, hence, ensuring privacy of the Bellabeat app users involved.
The data’s integrity was verified with the aid of analysis in R software, this includes but not limited to detailed validation and cross-checking of the columns and structures of the datasets prior to the main analysis. This verification helps to address the business questions by ensuring that only the credible datasets were analyzed to develop appropriate data-driven business decisions.
PROCESS PHASE: For this analysis, only R Studio was used. This was because R has every necessary functions to cross-validate and analyse the given datasets. It provides suitable platforms for checking and validating any error, which were then removed prior to analysis. R also allows for proper documentation of the cleaning process. Moreso, it allows making aesthetic data visuals for the stakeholders and other concerned viewers while also providing a platform for sharing the analyzed results.
ANALYZE PHASE: At every instant of analyzing the data, whenever there is any surprising insight, the necessary information is emphasized beneath each analysis, and the business insights and recommendations are provided afterwards. This help to ensure that stakeholders have a step-by-step business insight at every stage of the analysis without mincing or missing any important information. Note:Feel free to read the business insights provided below every analysis as needed.
SHARE PHASE: There are different tables and visuals including bar charts and lines created in the course of the analysis. These help to provide brief yet detailed information to the stakeholders to understand the business recommendations provided.
The datasets were downloaded from Kaggle webpage via this link: https://www.kaggle.com/datasets/arashnic/fitbit
There are 18 total CSV files in the package, however, upon examining these datasets, only two were finally used for this project analysis. These datasets are the dailyActivity_merged.csv and sleepDay_merged.csv, after which both datasets were merged for further analysis.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.5
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.0.5
## Warning: package 'tibble' was built under R version 4.0.5
## Warning: package 'tidyr' was built under R version 4.0.5
## Warning: package 'readr' was built under R version 4.0.5
## Warning: package 'dplyr' was built under R version 4.0.5
## Warning: package 'forcats' was built under R version 4.0.5
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(readr)
library(dplyr)
library(knitr)
library(ggplot2)
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.0.3
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.0.5
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(rmarkdown)
library(janitor)
## Warning: package 'janitor' was built under R version 4.0.5
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
For this project, two datasets were finally selected, and these were: 1. sleepDay_merged 2. dailyActvity_merged
daily_sleep <- read_csv("C:\\Users\\personal\\Documents\\BellaBeat\\sleepDay_merged.csv")
## Rows: 413 Columns: 5-- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(daily_sleep)
## # A tibble: 6 x 5
## Id SleepDay TotalSleepRecor~ TotalMinutesAsl~ TotalTimeInBed
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:00:0~ 1 327 346
## 2 1503960366 4/13/2016 12:00:0~ 2 384 407
## 3 1503960366 4/15/2016 12:00:0~ 1 412 442
## 4 1503960366 4/16/2016 12:00:0~ 2 340 367
## 5 1503960366 4/17/2016 12:00:0~ 1 700 712
## 6 1503960366 4/19/2016 12:00:0~ 1 304 320
str(daily_sleep)
## spec_tbl_df [413 x 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:413] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ SleepDay : chr [1:413] "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
## $ TotalSleepRecords : num [1:413] 1 2 1 2 1 1 1 1 1 1 ...
## $ TotalMinutesAsleep: num [1:413] 327 384 412 340 700 304 360 325 361 430 ...
## $ TotalTimeInBed : num [1:413] 346 407 442 367 712 320 377 364 384 449 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. SleepDay = col_character(),
## .. TotalSleepRecords = col_double(),
## .. TotalMinutesAsleep = col_double(),
## .. TotalTimeInBed = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
nrow(daily_sleep)
## [1] 413
n_distinct(daily_sleep$Id)
## [1] 24
sum(is.na(daily_sleep))
## [1] 0
sum(duplicated(daily_sleep))
## [1] 3
str(daily_sleep)
## spec_tbl_df [413 x 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:413] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ SleepDay : chr [1:413] "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
## $ TotalSleepRecords : num [1:413] 1 2 1 2 1 1 1 1 1 1 ...
## $ TotalMinutesAsleep: num [1:413] 327 384 412 340 700 304 360 325 361 430 ...
## $ TotalTimeInBed : num [1:413] 346 407 442 367 712 320 377 364 384 449 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. SleepDay = col_character(),
## .. TotalSleepRecords = col_double(),
## .. TotalMinutesAsleep = col_double(),
## .. TotalTimeInBed = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
There are 24 distinct observations recorded in the daily_sleep dataset
clean_names(daily_sleep)
## # A tibble: 413 x 5
## id sleep_day total_sleep_rec~ total_minutes_a~ total_time_in_b~
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:0~ 1 327 346
## 2 1503960366 4/13/2016 12:0~ 2 384 407
## 3 1503960366 4/15/2016 12:0~ 1 412 442
## 4 1503960366 4/16/2016 12:0~ 2 340 367
## 5 1503960366 4/17/2016 12:0~ 1 700 712
## 6 1503960366 4/19/2016 12:0~ 1 304 320
## 7 1503960366 4/20/2016 12:0~ 1 360 377
## 8 1503960366 4/21/2016 12:0~ 1 325 364
## 9 1503960366 4/23/2016 12:0~ 1 361 384
## 10 1503960366 4/24/2016 12:0~ 1 430 449
## # ... with 403 more rows
daily_sleep <- daily_sleep %>%
rename(Date = SleepDay) %>%
mutate(Date = as_date(Date,format ="%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))
## Warning: `tz` argument is ignored by `as_date()`
head(daily_sleep)
## # A tibble: 6 x 5
## Id Date TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## <dbl> <date> <dbl> <dbl> <dbl>
## 1 1503960366 2016-04-12 1 327 346
## 2 1503960366 2016-04-13 2 384 407
## 3 1503960366 2016-04-15 1 412 442
## 4 1503960366 2016-04-16 2 340 367
## 5 1503960366 2016-04-17 1 700 712
## 6 1503960366 2016-04-19 1 304 320
daily_sleep %>%
select(TotalSleepRecords,
TotalMinutesAsleep,
TotalTimeInBed) %>%
summary()
## TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## Min. :1.000 Min. : 58.0 Min. : 61.0
## 1st Qu.:1.000 1st Qu.:361.0 1st Qu.:403.0
## Median :1.000 Median :433.0 Median :463.0
## Mean :1.119 Mean :419.5 Mean :458.6
## 3rd Qu.:1.000 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :3.000 Max. :796.0 Max. :961.0
The TotalMinutesAsleep shows that the users are having an average of 7 hours sleep per day which falls within the recommended amount of sleep for an individual. Link here: https://www.cdc.gov/sleep/about_sleep/how_much_sleep.html
Business Insight: The app might be updated to inform users of their daily sleep hours per day, while reminding them of the recommended hours of sleep as regards tending towards living a healthy lifestyle.
ggplot(data=daily_sleep, aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) + geom_point(aes(color=Date)) + geom_smooth(mapping = aes(x = TotalMinutesAsleep, y = TotalTimeInBed), color = "purple") +
labs(title = "TotalMinutesAsleep Vs TotalTimeInBed")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
cor(daily_sleep$TotalMinutesAsleep, daily_sleep$TotalTimeInBed)
## [1] 0.9304575
There is a very strong positive relationship (0.93) between TotalMinutesAsleep and TotalTimeInBed.
Also, the TotalMinutesAsleep is lesser than the TotalTimeInBed. This is expected and shows some level of accuracy of the app since users will usually stay in bed before they sleep off.
The amount of people that reported their daily sleep information is not enough for a relatively informative statistical analysis.
Business Insight: Persuading more people to report this information will help to provide detailed information about the relationship between sleep time and healthy living. The company might have to provide their customers with insights to how providing this information could lead to a better utilization of the device.
daily_activity <- read_csv("C:\\Users\\personal\\Documents\\BellaBeat\\dailyActivity_merged.csv")
## Rows: 940 Columns: 15-- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(daily_activity)
## # A tibble: 6 x 15
## Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie~
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2016 13162 8.5 8.5 0
## 2 1.50e9 4/13/2016 10735 6.97 6.97 0
## 3 1.50e9 4/14/2016 10460 6.74 6.74 0
## 4 1.50e9 4/15/2016 9762 6.28 6.28 0
## 5 1.50e9 4/16/2016 12669 8.16 8.16 0
## 6 1.50e9 4/17/2016 9705 6.48 6.48 0
## # ... with 9 more variables: VeryActiveDistance <dbl>,
## # ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## # SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## # FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## # SedentaryMinutes <dbl>, Calories <dbl>
str(daily_activity)
## spec_tbl_df [940 x 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDate : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ TotalSteps : num [1:940] 13162 10735 10460 9762 12669 ...
## $ TotalDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : num [1:940] 728 776 1218 726 773 ...
## $ Calories : num [1:940] 1985 1797 1776 1745 1863 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. ActivityDate = col_character(),
## .. TotalSteps = col_double(),
## .. TotalDistance = col_double(),
## .. TrackerDistance = col_double(),
## .. LoggedActivitiesDistance = col_double(),
## .. VeryActiveDistance = col_double(),
## .. ModeratelyActiveDistance = col_double(),
## .. LightActiveDistance = col_double(),
## .. SedentaryActiveDistance = col_double(),
## .. VeryActiveMinutes = col_double(),
## .. FairlyActiveMinutes = col_double(),
## .. LightlyActiveMinutes = col_double(),
## .. SedentaryMinutes = col_double(),
## .. Calories = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
nrow(daily_activity) #number of rows in the daily_activity dataset
## [1] 940
n_distinct(daily_activity$Id)
## [1] 33
sum(is.na(daily_activity))
## [1] 0
sum(duplicated(daily_activity))
## [1] 0
clean_names(daily_activity)
## # A tibble: 940 x 15
## id activity_date total_steps total_distance tracker_distance
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 13162 8.5 8.5
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## 7 1503960366 4/18/2016 13019 8.59 8.59
## 8 1503960366 4/19/2016 15506 9.88 9.88
## 9 1503960366 4/20/2016 10544 6.68 6.68
## 10 1503960366 4/21/2016 9819 6.34 6.34
## # ... with 930 more rows, and 10 more variables:
## # logged_activities_distance <dbl>, very_active_distance <dbl>,
## # moderately_active_distance <dbl>, light_active_distance <dbl>,
## # sedentary_active_distance <dbl>, very_active_minutes <dbl>,
## # fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
## # sedentary_minutes <dbl>, calories <dbl>
daily_activity <- daily_activity %>%
rename(Date = ActivityDate) %>%
mutate(Date = as_date(Date, format = "%m/%d/%Y"))
head(daily_activity)
## # A tibble: 6 x 15
## Id Date TotalSteps TotalDistance TrackerDistance LoggedActivitie~
## <dbl> <date> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 2016-04-12 13162 8.5 8.5 0
## 2 1.50e9 2016-04-13 10735 6.97 6.97 0
## 3 1.50e9 2016-04-14 10460 6.74 6.74 0
## 4 1.50e9 2016-04-15 9762 6.28 6.28 0
## 5 1.50e9 2016-04-16 12669 8.16 8.16 0
## 6 1.50e9 2016-04-17 9705 6.48 6.48 0
## # ... with 9 more variables: VeryActiveDistance <dbl>,
## # ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## # SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## # FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## # SedentaryMinutes <dbl>, Calories <dbl>
daily_activity %>%
select(TotalSteps,
TotalDistance,
SedentaryMinutes) %>%
summary()
## TotalSteps TotalDistance SedentaryMinutes
## Min. : 0 Min. : 0.000 Min. : 0.0
## 1st Qu.: 3790 1st Qu.: 2.620 1st Qu.: 729.8
## Median : 7406 Median : 5.245 Median :1057.5
## Mean : 7638 Mean : 5.490 Mean : 991.2
## 3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.:1229.5
## Max. :36019 Max. :28.030 Max. :1440.0
ggplot(data=daily_activity) + geom_smooth(mapping = aes(x=TotalSteps, y=Calories), color = "green") + geom_point(mapping = aes(x=TotalSteps, y=Calories), color = "blue") +
labs(title = "TotalSteps Vs Calories")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
cor(daily_activity$TotalSteps, daily_activity$Calories)
## [1] 0.5915681
There is a positively moderate relationship (r = 0.59) between TotalSteps and Calories burnt. Although, there are outliers with an individual taken >30,000 steps, and another set of outliers taken 0 steps per day, these might be an itch recorded by the app. However, it was obvious that majority of the users recorded an average of 8,000 steps per day.
Business Insight: The company might add an update to the app to inform the user of the category they fall within the number of steps taken, example of such information is that - “A particular percentage (%) of users are taken about 8,000 steps per day, you might want to do better today by walking out to burn a reasonable amount of calories.” This could be targeted to users who do not walk out on a daily basis but uses the app for other purposes.
ggplot(data=daily_activity) +geom_smooth(mapping = aes(x=TotalSteps, y=SedentaryMinutes), color = "green") + geom_point(mapping = aes(x = TotalSteps, y = SedentaryMinutes), color = "blue") +
labs(title = "TotalSteps Vs SedentaryMinutes")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
cor(daily_activity$TotalSteps, daily_activity$SedentaryMinutes)
## [1] -0.3274835
There is a negatively moderate correlation (0.33) between TotalSteps and SedentaryMinutes. This is expected as an increase in the number of steps taken by an individual means a decresse in the time they spent being on a spot. The relatively high sedentary minutes recorded also shows that most users are on a particular spot as against few others that are moving from one spot to the other on a daily basis.
Business Insight: The smart device may be updated to include information that will target users that are recording relatively low sedentary minutes so that they can be reminded of the need to take needed steps so as to keep fit and even burn calories if need be. This can be another marketing strategy to win more potential customers.
ggplot(data=daily_activity) + geom_smooth(mapping = aes(x = TotalSteps, y = TotalDistance), color = "green") + geom_point(mapping = aes(x=TotalSteps, y=TotalDistance), color = "blue") +
labs(title = "TotalDistance vs TotalSteps")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
cor(daily_activity$TotalDistance, daily_activity$TotalSteps)
## [1] 0.9853688
There is a positively strong correlation (1.0) between TotalDistance and TotalSteps. This is just another information to display the accuracy of the smart device since the total distance covered by a user is expected to be relatively correlated with the measured steps they take over time.
With the aid of all = TRUE, there will be a complete combination of all the rows in the two datasets.
combined_data <- merge(daily_activity, daily_sleep, by=c ('Id', 'Date'), all = TRUE)
glimpse(combined_data)
## Rows: 943
## Columns: 18
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036~
## $ Date <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-04-~
## $ TotalSteps <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019~
## $ TotalDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ TrackerDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5~
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3~
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0~
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveMinutes <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4~
## $ FairlyActiveMinutes <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21~
## $ LightlyActiveMinutes <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, ~
## $ SedentaryMinutes <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818~
## $ Calories <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203~
## $ TotalSleepRecords <dbl> 1, 2, NA, 1, 2, 1, NA, 1, 1, 1, NA, 1, 1, 1, ~
## $ TotalMinutesAsleep <dbl> 327, 384, NA, 412, 340, 700, NA, 304, 360, 32~
## $ TotalTimeInBed <dbl> 346, 407, NA, 442, 367, 712, NA, 320, 377, 36~
n_distinct(combined_data$Id)
## [1] 33
sum(is.na(combined_data))
## [1] 1590
combined_data <- combined_data %>%
mutate_if(is.numeric, ~replace(., is.na(.), 0))
sum(is.na(combined_data))
## [1] 0
format(as.Date(combined_data$Date),"%w")
## [1] "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5"
## [19] "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6"
## [37] "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3"
## [55] "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4"
## [73] "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1"
## [91] "2" "3" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3"
## [109] "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4"
## [127] "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1"
## [145] "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0" "1" "2"
## [163] "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6"
## [181] "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0"
## [199] "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4"
## [217] "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5"
## [235] "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6"
## [253] "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "2" "3" "4" "5" "6"
## [271] "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3"
## [289] "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4"
## [307] "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "2" "3" "4" "5" "6" "0" "1" "2"
## [325] "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6"
## [343] "0" "1" "2" "3" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1"
## [361] "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2"
## [379] "3" "4" "5" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2"
## [397] "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3"
## [415] "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0"
## [433] "1" "2" "3" "4" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0"
## [451] "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4"
## [469] "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5"
## [487] "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2"
## [505] "3" "4" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3"
## [523] "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "6" "0" "1" "2" "3" "4" "2" "3"
## [541] "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0"
## [559] "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0" "1"
## [577] "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5"
## [595] "6" "0" "1" "2" "3" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0"
## [613] "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4"
## [631] "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1"
## [649] "2" "3" "4" "5" "6" "0" "1" "2" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4"
## [667] "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "2" "3"
## [685] "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0"
## [703] "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0" "1"
## [721] "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5"
## [739] "6" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4"
## [757] "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5"
## [775] "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2"
## [793] "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0" "1" "2" "3"
## [811] "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "2" "3" "4" "5" "6" "0" "1" "2"
## [829] "3" "4" "5" "6" "0" "1" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5"
## [847] "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6"
## [865] "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3"
## [883] "4" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4"
## [901] "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "2" "3" "4" "5" "6" "0"
## [919] "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4"
## [937] "5" "6" "0" "1" "2" "3" "4"
combined_data$DayOfTheWeek <- format(as.Date(combined_data$Date),"%w")
wday(combined_data$Date, label=TRUE)
## [1] Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri
## [19] Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Tue Wed Thu Fri Sat
## [37] Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed
## [55] Thu Fri Sat Sun Mon Tue Wed Thu Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu
## [73] Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon
## [91] Tue Wed Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed
## [109] Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Tue Wed Thu
## [127] Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon
## [145] Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Tue Wed Thu Fri Sat Sun Mon Tue
## [163] Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat
## [181] Sun Mon Tue Wed Thu Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun
## [199] Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu
## [217] Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri
## [235] Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Tue Wed Thu Fri Sat
## [253] Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Tue Wed Thu Fri Sat
## [271] Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed
## [289] Thu Fri Sat Sun Mon Tue Wed Thu Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu
## [307] Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Tue Wed Thu Fri Sat Sun Mon Tue
## [325] Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat
## [343] Sun Mon Tue Wed Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon
## [361] Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Tue
## [379] Wed Thu Fri Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue
## [397] Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Tue Wed
## [415] Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun
## [433] Mon Tue Wed Thu Thu Fri Sat Sun Mon Tue Wed Thu Tue Wed Thu Fri Sat Sun
## [451] Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu
## [469] Fri Sat Sun Mon Tue Wed Thu Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri
## [487] Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue
## [505] Wed Thu Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed
## [523] Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sat Sun Mon Tue Wed Thu Tue Wed
## [541] Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun
## [559] Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Tue Wed Thu Fri Sat Sun Mon
## [577] Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri
## [595] Sat Sun Mon Tue Wed Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun
## [613] Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu
## [631] Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon
## [649] Tue Wed Thu Fri Sat Sun Mon Tue Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu
## [667] Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Tue Wed
## [685] Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun
## [703] Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Tue Wed Thu Fri Sat Sun Mon
## [721] Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri
## [739] Sat Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu
## [757] Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Tue Wed Thu Fri
## [775] Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue
## [793] Wed Thu Fri Sat Sun Mon Tue Wed Thu Tue Wed Thu Fri Sat Sun Mon Tue Wed
## [811] Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Tue Wed Thu Fri Sat Sun Mon Tue
## [829] Wed Thu Fri Sat Sun Mon Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri
## [847] Sat Sun Mon Tue Wed Thu Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat
## [865] Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed
## [883] Thu Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu
## [901] Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Tue Wed Thu Fri Sat Sun
## [919] Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu
## [937] Fri Sat Sun Mon Tue Wed Thu
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
str(combined_data)
## 'data.frame': 943 obs. of 19 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ Date : Date, format: "2016-04-12" "2016-04-13" ...
## $ TotalSteps : num 13162 10735 10460 9762 12669 ...
## $ TotalDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : num 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : num 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : num 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : num 728 776 1218 726 773 ...
## $ Calories : num 1985 1797 1776 1745 1863 ...
## $ TotalSleepRecords : num 1 2 0 1 2 1 0 1 1 1 ...
## $ TotalMinutesAsleep : num 327 384 0 412 340 700 0 304 360 325 ...
## $ TotalTimeInBed : num 346 407 0 442 367 712 0 320 377 364 ...
## $ DayOfTheWeek : chr "2" "3" "4" "5" ...
combined_data$TotalMinutes = combined_data$VeryActiveMinutes+combined_data$FairlyActiveMinutes+combined_data$LightlyActiveMinutes+combined_data$SedentaryMinutes
str(combined_data)
## 'data.frame': 943 obs. of 20 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ Date : Date, format: "2016-04-12" "2016-04-13" ...
## $ TotalSteps : num 13162 10735 10460 9762 12669 ...
## $ TotalDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : num 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : num 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : num 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : num 728 776 1218 726 773 ...
## $ Calories : num 1985 1797 1776 1745 1863 ...
## $ TotalSleepRecords : num 1 2 0 1 2 1 0 1 1 1 ...
## $ TotalMinutesAsleep : num 327 384 0 412 340 700 0 304 360 325 ...
## $ TotalTimeInBed : num 346 407 0 442 367 712 0 320 377 364 ...
## $ DayOfTheWeek : chr "2" "3" "4" "5" ...
## $ TotalMinutes : num 1094 1033 1440 998 1040 ...
combined_data$DayOfTheWeek = strftime(combined_data$Date,'%A')
combined_data$DayOfTheWeek = factor(combined_data$Day, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday","Sunday"))
ggplot(data=combined_data,aes(x=DayOfTheWeek, fill=DayOfTheWeek)) + geom_bar(stat = "count") +
theme(plot.title = element_text(hjust = 0.5, lineheight = 0.8, face = "bold")) +
labs(x = 'Day of Week',
y = 'Frequency of logged in Times',
title = 'Number of times users logged in app across the week')
The app recorded greater logged-in times on Tuesdays, Wednesdays, and Thursdays, compared to a lesser logged-in time recorded for Friday till Monday. Due to some limitations such as information about the type of jobs, age groups, etc of the users, the report cannot be fully utilized in giving a detailed picture of why users could be taken lesser steps on certain days of the week.
It could only be assumed that the users stay in-doors during those weekends, and then take more steps at work during the weekdays. A more detailed information about the demographic of the users and even data from more users will help to provide additional insights to this kind of analysis.
On the other end, it could be that users drive around on weekends to places when they are not at work. If that be the case, then: Business Insight: Users may be updated about the need to take a walk during the weekends rather than driving or being at a single spot longer than usual on weekends.
combined_data %>%
select(TotalSteps,
TotalDistance,
TotalMinutes,
SedentaryMinutes,
Calories) %>%
summary()
## TotalSteps TotalDistance TotalMinutes SedentaryMinutes
## Min. : 0 Min. : 0.000 Min. : 2.0 Min. : 0.0
## 1st Qu.: 3795 1st Qu.: 2.620 1st Qu.: 989.5 1st Qu.: 729.0
## Median : 7439 Median : 5.260 Median :1440.0 Median :1057.0
## Mean : 7652 Mean : 5.503 Mean :1218.2 Mean : 990.4
## 3rd Qu.:10734 3rd Qu.: 7.720 3rd Qu.:1440.0 3rd Qu.:1229.0
## Max. :36019 Max. :28.030 Max. :1440.0 Max. :1440.0
## Calories
## Min. : 0
## 1st Qu.:1830
## Median :2140
## Mean :2308
## 3rd Qu.:2796
## Max. :4900
g1= ggplot(data=combined_data)+
geom_point(mapping = aes(x = TotalDistance, y =Calories),color="green")+
labs(title="Total Distance Vs Calories")
cor(combined_data$TotalDistance, combined_data$Calories)
## [1] 0.6466023
g2= ggplot(data=combined_data)+
geom_point(mapping = aes(x = TotalSteps, y =Calories),color="blue")+
labs(title="Total Steps Vs Calories")
cor(combined_data$TotalSteps, combined_data$Calories)
## [1] 0.5929493
ggarrange(g1, g2, ncol = 2, nrow = 1)
There is a moderately strong relationship (0.65) between the TotalDistance and the Calories and similarly a moderately strong relationship of 0.60 between the TotalSteps and the Calories.
As reported earlier, TotalSteps and TotalDistance are strongly correlated, hence this similar result from these two graphs. Obviously, from these two graphs, the more steps or distance covered, the more calories burnt by an individual.
Business Insight: This information can be updated in the device to inform users that they can stay fitted by taken more steps on a daily basis.
ggplot(data = combined_data) + geom_point(mapping = aes(x=TotalSteps, y=Calories, color=TotalSteps)) + scale_color_gradientn(colours = "terrain.colors"(12)) +
geom_hline(yintercept = 2308, color = "purple", size = 0.5) +
geom_vline(xintercept = 7652, color = "black", size = 0.5) +
geom_text(aes(x=9500, y=2200, label="Mean"), color="black", size=5) +
theme(plot.title = element_text(hjust = 0.2, lineheight = 0.5, face = "bold")) +
labs(
x = 'Steps taken',
y = 'Calories burned',
title = 'Calories burnt per step taken')
cor(combined_data$TotalSteps, combined_data$Calories)
## [1] 0.5929493
Revised Business Recommendations To improve the marketing strategy for the Bellabeat app, the company may update the information provided to their users on a daily or weekly basis by using the various recommendations provided above.
One of these recommendations is to inform users and potential customers about the ability to track their lifestyles on a daily/weekly basis by utilizing all the resources available in this smart device. Having detailed insights about their daily lifestyles could enable them focus on the necessary adjustments to put in place to ensure healthy living. These resources include monitoring stress, mentsrual cycle, daily steps taken, etc. Encouraging users to log in and utilize every resources of the devices for record keeping will enable the company to provide the users with more information about their daily lifestyle.
The company may also need to investigate the device to address the itch that makes it provide some strange reports/outliers such as the 0 and/or 30,000 TotalSteps per day.
In conclusion, the company may use the information/recommendations provided above to determine/target potential customers.