About the company: Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused smart products.

Product: Bellabeat app - The Bellabeat app provides users with health data - related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.

BUINESS PROBLEM Analyze the smart device usage data in order to gain insight into how consumers use these smart devices.

ASK PHASE: The problem this project is trying to solve is investigating and understanding some trends in the smart device usage. It also involves understanding how these trends apply to Bellabeat customers and how the trends could help influence Bellabeat marketing strategy. Stakeholders: Urška Sršen - Cofounder & Chief Creative Officer of Bellabeat

PREPARE PHASE: The data is stored is available on the Kaggle website. It is stored in form of a csv and long format. There seems to be an issue with bias and credibility. On one hand, the data doesn’t give information about the demographic or work force lifestyle of the respondents involved, so this makes it difficult to make concrete decision based on the derived analysis. On another hand, few of the data do not have detailed metadata or even defined headings to identify the content in the dataset which makes some of the dataset unusable and unreliable. This imposed a limitation to the useability of some of these datasets.

As regards addressing licensing, privacy, security, and accessibility, the data was licensed for available usage by Kaggle, the information about the samples were not revealed in the dataset, hence, ensuring privacy of the Bellabeat app users involved.

The data’s integrity was verified with the aid of analysis in R software, this includes but not limited to detailed validation and cross-checking of the columns and structures of the datasets prior to the main analysis. This verification helps to address the business questions by ensuring that only the credible datasets were analyzed to develop appropriate data-driven business decisions.

PROCESS PHASE: For this analysis, only R Studio was used. This was because R has every necessary functions to cross-validate and analyse the given datasets. It provides suitable platforms for checking and validating any error, which were then removed prior to analysis. R also allows for proper documentation of the cleaning process. Moreso, it allows making aesthetic data visuals for the stakeholders and other concerned viewers while also providing a platform for sharing the analyzed results.

ANALYZE PHASE: At every instant of analyzing the data, whenever there is any surprising insight, the necessary information is emphasized beneath each analysis, and the business insights and recommendations are provided afterwards. This help to ensure that stakeholders have a step-by-step business insight at every stage of the analysis without mincing or missing any important information. Note:Feel free to read the business insights provided below every analysis as needed.

SHARE PHASE: There are different tables and visuals including bar charts and lines created in the course of the analysis. These help to provide brief yet detailed information to the stakeholders to understand the business recommendations provided.

The datasets were downloaded from Kaggle webpage via this link: https://www.kaggle.com/datasets/arashnic/fitbit

There are 18 total CSV files in the package, however, upon examining these datasets, only two were finally used for this project analysis. These datasets are the dailyActivity_merged.csv and sleepDay_merged.csv, after which both datasets were merged for further analysis.

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.5
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.0.5
## Warning: package 'tibble' was built under R version 4.0.5
## Warning: package 'tidyr' was built under R version 4.0.5
## Warning: package 'readr' was built under R version 4.0.5
## Warning: package 'dplyr' was built under R version 4.0.5
## Warning: package 'forcats' was built under R version 4.0.5
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(readr)
library(dplyr)
library(knitr)
library(ggplot2)
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.0.3
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.0.5
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(rmarkdown)
library(janitor)
## Warning: package 'janitor' was built under R version 4.0.5
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

For this project, two datasets were finally selected, and these were: 1. sleepDay_merged 2. dailyActvity_merged

Table 1: sleepDay_merged

Importing dataset "sleepDay_merged.csv and previewing the first few columns

daily_sleep <- read_csv("C:\\Users\\personal\\Documents\\BellaBeat\\sleepDay_merged.csv")
## Rows: 413 Columns: 5-- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(daily_sleep)
## # A tibble: 6 x 5
##           Id SleepDay           TotalSleepRecor~ TotalMinutesAsl~ TotalTimeInBed
##        <dbl> <chr>                         <dbl>            <dbl>          <dbl>
## 1 1503960366 4/12/2016 12:00:0~                1              327            346
## 2 1503960366 4/13/2016 12:00:0~                2              384            407
## 3 1503960366 4/15/2016 12:00:0~                1              412            442
## 4 1503960366 4/16/2016 12:00:0~                2              340            367
## 5 1503960366 4/17/2016 12:00:0~                1              700            712
## 6 1503960366 4/19/2016 12:00:0~                1              304            320
str(daily_sleep)
## spec_tbl_df [413 x 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id                : num [1:413] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ SleepDay          : chr [1:413] "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
##  $ TotalSleepRecords : num [1:413] 1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: num [1:413] 327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : num [1:413] 346 407 442 367 712 320 377 364 384 449 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   SleepDay = col_character(),
##   ..   TotalSleepRecords = col_double(),
##   ..   TotalMinutesAsleep = col_double(),
##   ..   TotalTimeInBed = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

In order to analyze summary statistics for daily_sleep data, further investigations were made into the dataset.

nrow(daily_sleep)
## [1] 413
n_distinct(daily_sleep$Id)
## [1] 24
sum(is.na(daily_sleep))
## [1] 0
sum(duplicated(daily_sleep))
## [1] 3
str(daily_sleep)
## spec_tbl_df [413 x 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id                : num [1:413] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ SleepDay          : chr [1:413] "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
##  $ TotalSleepRecords : num [1:413] 1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: num [1:413] 327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : num [1:413] 346 407 442 367 712 320 377 364 384 449 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   SleepDay = col_character(),
##   ..   TotalSleepRecords = col_double(),
##   ..   TotalMinutesAsleep = col_double(),
##   ..   TotalTimeInBed = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

There are 24 distinct observations recorded in the daily_sleep dataset

Cleaning daily_sleep dataset

clean_names(daily_sleep)
## # A tibble: 413 x 5
##            id sleep_day       total_sleep_rec~ total_minutes_a~ total_time_in_b~
##         <dbl> <chr>                      <dbl>            <dbl>            <dbl>
##  1 1503960366 4/12/2016 12:0~                1              327              346
##  2 1503960366 4/13/2016 12:0~                2              384              407
##  3 1503960366 4/15/2016 12:0~                1              412              442
##  4 1503960366 4/16/2016 12:0~                2              340              367
##  5 1503960366 4/17/2016 12:0~                1              700              712
##  6 1503960366 4/19/2016 12:0~                1              304              320
##  7 1503960366 4/20/2016 12:0~                1              360              377
##  8 1503960366 4/21/2016 12:0~                1              325              364
##  9 1503960366 4/23/2016 12:0~                1              361              384
## 10 1503960366 4/24/2016 12:0~                1              430              449
## # ... with 403 more rows

Converting the date format in the daily_sleep data

daily_sleep <- daily_sleep %>%
  rename(Date = SleepDay) %>%
  mutate(Date = as_date(Date,format ="%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))
## Warning: `tz` argument is ignored by `as_date()`

Cross-checking if column date on the daily_sleep table has been correctly updated

head(daily_sleep)
## # A tibble: 6 x 5
##           Id Date       TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
##        <dbl> <date>                 <dbl>              <dbl>          <dbl>
## 1 1503960366 2016-04-12                 1                327            346
## 2 1503960366 2016-04-13                 2                384            407
## 3 1503960366 2016-04-15                 1                412            442
## 4 1503960366 2016-04-16                 2                340            367
## 5 1503960366 2016-04-17                 1                700            712
## 6 1503960366 2016-04-19                 1                304            320

Summarizing statistics for daily_sleep data

daily_sleep %>%  
  select(TotalSleepRecords,
  TotalMinutesAsleep,
  TotalTimeInBed) %>%
  summary()
##  TotalSleepRecords TotalMinutesAsleep TotalTimeInBed 
##  Min.   :1.000     Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:1.000     1st Qu.:361.0      1st Qu.:403.0  
##  Median :1.000     Median :433.0      Median :463.0  
##  Mean   :1.119     Mean   :419.5      Mean   :458.6  
##  3rd Qu.:1.000     3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :3.000     Max.   :796.0      Max.   :961.0

The TotalMinutesAsleep shows that the users are having an average of 7 hours sleep per day which falls within the recommended amount of sleep for an individual. Link here: https://www.cdc.gov/sleep/about_sleep/how_much_sleep.html

Business Insight: The app might be updated to inform users of their daily sleep hours per day, while reminding them of the recommended hours of sleep as regards tending towards living a healthy lifestyle.

Plotting a graph for the relationship between TotalMinutesAsleep and TotalTimeInBed.

ggplot(data=daily_sleep, aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) + geom_point(aes(color=Date)) + geom_smooth(mapping = aes(x = TotalMinutesAsleep, y = TotalTimeInBed), color = "purple") +
  labs(title = "TotalMinutesAsleep Vs TotalTimeInBed")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

cor(daily_sleep$TotalMinutesAsleep, daily_sleep$TotalTimeInBed)
## [1] 0.9304575

There is a very strong positive relationship (0.93) between TotalMinutesAsleep and TotalTimeInBed.

Also, the TotalMinutesAsleep is lesser than the TotalTimeInBed. This is expected and shows some level of accuracy of the app since users will usually stay in bed before they sleep off.

The amount of people that reported their daily sleep information is not enough for a relatively informative statistical analysis.

Business Insight: Persuading more people to report this information will help to provide detailed information about the relationship between sleep time and healthy living. The company might have to provide their customers with insights to how providing this information could lead to a better utilization of the device.

Table 2: dailyActvity_merged

Exploring and previewing the first few rows of the dataset

daily_activity <- read_csv("C:\\Users\\personal\\Documents\\BellaBeat\\dailyActivity_merged.csv")
## Rows: 940 Columns: 15-- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(daily_activity)
## # A tibble: 6 x 15
##        Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie~
##     <dbl> <chr>             <dbl>         <dbl>           <dbl>            <dbl>
## 1  1.50e9 4/12/2016         13162          8.5             8.5                 0
## 2  1.50e9 4/13/2016         10735          6.97            6.97                0
## 3  1.50e9 4/14/2016         10460          6.74            6.74                0
## 4  1.50e9 4/15/2016          9762          6.28            6.28                0
## 5  1.50e9 4/16/2016         12669          8.16            8.16                0
## 6  1.50e9 4/17/2016          9705          6.48            6.48                0
## # ... with 9 more variables: VeryActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>
str(daily_activity)
## spec_tbl_df [940 x 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id                      : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : num [1:940] 13162 10735 10460 9762 12669 ...
##  $ TotalDistance           : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : num [1:940] 728 776 1218 726 773 ...
##  $ Calories                : num [1:940] 1985 1797 1776 1745 1863 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   ActivityDate = col_character(),
##   ..   TotalSteps = col_double(),
##   ..   TotalDistance = col_double(),
##   ..   TrackerDistance = col_double(),
##   ..   LoggedActivitiesDistance = col_double(),
##   ..   VeryActiveDistance = col_double(),
##   ..   ModeratelyActiveDistance = col_double(),
##   ..   LightActiveDistance = col_double(),
##   ..   SedentaryActiveDistance = col_double(),
##   ..   VeryActiveMinutes = col_double(),
##   ..   FairlyActiveMinutes = col_double(),
##   ..   LightlyActiveMinutes = col_double(),
##   ..   SedentaryMinutes = col_double(),
##   ..   Calories = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Analyzing Summary Statistics for daily_activity data

nrow(daily_activity) #number of rows in the daily_activity dataset
## [1] 940
n_distinct(daily_activity$Id)
## [1] 33
sum(is.na(daily_activity))
## [1] 0
sum(duplicated(daily_activity))
## [1] 0

Cleaning daily_activity dataset

clean_names(daily_activity)
## # A tibble: 940 x 15
##            id activity_date total_steps total_distance tracker_distance
##         <dbl> <chr>               <dbl>          <dbl>            <dbl>
##  1 1503960366 4/12/2016           13162           8.5              8.5 
##  2 1503960366 4/13/2016           10735           6.97             6.97
##  3 1503960366 4/14/2016           10460           6.74             6.74
##  4 1503960366 4/15/2016            9762           6.28             6.28
##  5 1503960366 4/16/2016           12669           8.16             8.16
##  6 1503960366 4/17/2016            9705           6.48             6.48
##  7 1503960366 4/18/2016           13019           8.59             8.59
##  8 1503960366 4/19/2016           15506           9.88             9.88
##  9 1503960366 4/20/2016           10544           6.68             6.68
## 10 1503960366 4/21/2016            9819           6.34             6.34
## # ... with 930 more rows, and 10 more variables:
## #   logged_activities_distance <dbl>, very_active_distance <dbl>,
## #   moderately_active_distance <dbl>, light_active_distance <dbl>,
## #   sedentary_active_distance <dbl>, very_active_minutes <dbl>,
## #   fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
## #   sedentary_minutes <dbl>, calories <dbl>

The datatype of the date is a string, also in a yyyy-mm-dd format, so needs to be converted to a date (mdy) format.

daily_activity <- daily_activity %>%
  rename(Date = ActivityDate) %>%
  mutate(Date = as_date(Date, format = "%m/%d/%Y"))

Cross-checking if column date on the daily_activity table has been correctly updated

head(daily_activity)
## # A tibble: 6 x 15
##          Id Date       TotalSteps TotalDistance TrackerDistance LoggedActivitie~
##       <dbl> <date>          <dbl>         <dbl>           <dbl>            <dbl>
## 1    1.50e9 2016-04-12      13162          8.5             8.5                 0
## 2    1.50e9 2016-04-13      10735          6.97            6.97                0
## 3    1.50e9 2016-04-14      10460          6.74            6.74                0
## 4    1.50e9 2016-04-15       9762          6.28            6.28                0
## 5    1.50e9 2016-04-16      12669          8.16            8.16                0
## 6    1.50e9 2016-04-17       9705          6.48            6.48                0
## # ... with 9 more variables: VeryActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>

Summarizing statistics for daily_activity dataset

daily_activity %>%  
  select(TotalSteps,
         TotalDistance,
         SedentaryMinutes) %>%
  summary()
##    TotalSteps    TotalDistance    SedentaryMinutes
##  Min.   :    0   Min.   : 0.000   Min.   :   0.0  
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 729.8  
##  Median : 7406   Median : 5.245   Median :1057.5  
##  Mean   : 7638   Mean   : 5.490   Mean   : 991.2  
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:1229.5  
##  Max.   :36019   Max.   :28.030   Max.   :1440.0

Exploring the relationship between TotalSteps and Calories

ggplot(data=daily_activity) + geom_smooth(mapping = aes(x=TotalSteps, y=Calories), color = "green") + geom_point(mapping = aes(x=TotalSteps, y=Calories), color = "blue") +
  labs(title = "TotalSteps Vs Calories")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

cor(daily_activity$TotalSteps, daily_activity$Calories)
## [1] 0.5915681

There is a positively moderate relationship (r = 0.59) between TotalSteps and Calories burnt. Although, there are outliers with an individual taken >30,000 steps, and another set of outliers taken 0 steps per day, these might be an itch recorded by the app. However, it was obvious that majority of the users recorded an average of 8,000 steps per day.

Business Insight: The company might add an update to the app to inform the user of the category they fall within the number of steps taken, example of such information is that - “A particular percentage (%) of users are taken about 8,000 steps per day, you might want to do better today by walking out to burn a reasonable amount of calories.” This could be targeted to users who do not walk out on a daily basis but uses the app for other purposes.

Exploring the relationship between TotalSteps and SedentaryMinutes

ggplot(data=daily_activity) +geom_smooth(mapping = aes(x=TotalSteps, y=SedentaryMinutes), color = "green") + geom_point(mapping = aes(x = TotalSteps, y = SedentaryMinutes), color = "blue") +
  labs(title = "TotalSteps Vs SedentaryMinutes")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

cor(daily_activity$TotalSteps, daily_activity$SedentaryMinutes)
## [1] -0.3274835

There is a negatively moderate correlation (0.33) between TotalSteps and SedentaryMinutes. This is expected as an increase in the number of steps taken by an individual means a decresse in the time they spent being on a spot. The relatively high sedentary minutes recorded also shows that most users are on a particular spot as against few others that are moving from one spot to the other on a daily basis.

Business Insight: The smart device may be updated to include information that will target users that are recording relatively low sedentary minutes so that they can be reminded of the need to take needed steps so as to keep fit and even burn calories if need be. This can be another marketing strategy to win more potential customers.

Exploring the relationship between TotalSteps and TotalDistance

ggplot(data=daily_activity) + geom_smooth(mapping = aes(x = TotalSteps, y = TotalDistance), color = "green") + geom_point(mapping = aes(x=TotalSteps, y=TotalDistance), color = "blue") +
  labs(title = "TotalDistance vs TotalSteps")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

cor(daily_activity$TotalDistance, daily_activity$TotalSteps)
## [1] 0.9853688

There is a positively strong correlation (1.0) between TotalDistance and TotalSteps. This is just another information to display the accuracy of the smart device since the total distance covered by a user is expected to be relatively correlated with the measured steps they take over time.

Combining daily_sleep and daily_activity to get more insightful dataset and analysis.

With the aid of all = TRUE, there will be a complete combination of all the rows in the two datasets.

combined_data <- merge(daily_activity, daily_sleep, by=c ('Id', 'Date'), all = TRUE)
glimpse(combined_data)
## Rows: 943
## Columns: 18
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036~
## $ Date                     <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-04-~
## $ TotalSteps               <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019~
## $ TotalDistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5~
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3~
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0~
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveMinutes        <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4~
## $ FairlyActiveMinutes      <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21~
## $ LightlyActiveMinutes     <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, ~
## $ SedentaryMinutes         <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818~
## $ Calories                 <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203~
## $ TotalSleepRecords        <dbl> 1, 2, NA, 1, 2, 1, NA, 1, 1, 1, NA, 1, 1, 1, ~
## $ TotalMinutesAsleep       <dbl> 327, 384, NA, 412, 340, 700, NA, 304, 360, 32~
## $ TotalTimeInBed           <dbl> 346, 407, NA, 442, 367, 712, NA, 320, 377, 36~

Calculating the distinct number of observations, and replacing the NAs with 0s.

n_distinct(combined_data$Id)
## [1] 33
sum(is.na(combined_data))
## [1] 1590
combined_data <- combined_data %>% 
mutate_if(is.numeric, ~replace(., is.na(.), 0))
sum(is.na(combined_data))
## [1] 0

Converting the days of the week to numbers from 1 to 6, where Sun = 0, Mon = 1, and till Sat = 6

format(as.Date(combined_data$Date),"%w")
##   [1] "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5"
##  [19] "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6"
##  [37] "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3"
##  [55] "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4"
##  [73] "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1"
##  [91] "2" "3" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3"
## [109] "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4"
## [127] "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1"
## [145] "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0" "1" "2"
## [163] "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6"
## [181] "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0"
## [199] "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4"
## [217] "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5"
## [235] "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6"
## [253] "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "2" "3" "4" "5" "6"
## [271] "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3"
## [289] "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4"
## [307] "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "2" "3" "4" "5" "6" "0" "1" "2"
## [325] "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6"
## [343] "0" "1" "2" "3" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1"
## [361] "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2"
## [379] "3" "4" "5" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2"
## [397] "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3"
## [415] "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0"
## [433] "1" "2" "3" "4" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0"
## [451] "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4"
## [469] "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5"
## [487] "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2"
## [505] "3" "4" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3"
## [523] "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "6" "0" "1" "2" "3" "4" "2" "3"
## [541] "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0"
## [559] "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0" "1"
## [577] "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5"
## [595] "6" "0" "1" "2" "3" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0"
## [613] "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4"
## [631] "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1"
## [649] "2" "3" "4" "5" "6" "0" "1" "2" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4"
## [667] "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "2" "3"
## [685] "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0"
## [703] "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0" "1"
## [721] "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5"
## [739] "6" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4"
## [757] "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5"
## [775] "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2"
## [793] "3" "4" "5" "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0" "1" "2" "3"
## [811] "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "2" "3" "4" "5" "6" "0" "1" "2"
## [829] "3" "4" "5" "6" "0" "1" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5"
## [847] "6" "0" "1" "2" "3" "4" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6"
## [865] "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3"
## [883] "4" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4"
## [901] "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "2" "3" "4" "5" "6" "0"
## [919] "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4" "5" "6" "0" "1" "2" "3" "4"
## [937] "5" "6" "0" "1" "2" "3" "4"

Changing these numbers to actual days and naming the new column as “DayOfTheWeek”

combined_data$DayOfTheWeek <- format(as.Date(combined_data$Date),"%w")
wday(combined_data$Date, label=TRUE)
##   [1] Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri
##  [19] Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Tue Wed Thu Fri Sat
##  [37] Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed
##  [55] Thu Fri Sat Sun Mon Tue Wed Thu Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu
##  [73] Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon
##  [91] Tue Wed Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed
## [109] Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Tue Wed Thu
## [127] Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon
## [145] Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Tue Wed Thu Fri Sat Sun Mon Tue
## [163] Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat
## [181] Sun Mon Tue Wed Thu Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun
## [199] Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu
## [217] Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri
## [235] Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Tue Wed Thu Fri Sat
## [253] Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Tue Wed Thu Fri Sat
## [271] Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed
## [289] Thu Fri Sat Sun Mon Tue Wed Thu Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu
## [307] Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Tue Wed Thu Fri Sat Sun Mon Tue
## [325] Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat
## [343] Sun Mon Tue Wed Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon
## [361] Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Tue
## [379] Wed Thu Fri Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue
## [397] Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Tue Wed
## [415] Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun
## [433] Mon Tue Wed Thu Thu Fri Sat Sun Mon Tue Wed Thu Tue Wed Thu Fri Sat Sun
## [451] Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu
## [469] Fri Sat Sun Mon Tue Wed Thu Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri
## [487] Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue
## [505] Wed Thu Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed
## [523] Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sat Sun Mon Tue Wed Thu Tue Wed
## [541] Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun
## [559] Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Tue Wed Thu Fri Sat Sun Mon
## [577] Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri
## [595] Sat Sun Mon Tue Wed Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun
## [613] Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu
## [631] Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon
## [649] Tue Wed Thu Fri Sat Sun Mon Tue Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu
## [667] Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Tue Wed
## [685] Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun
## [703] Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Tue Wed Thu Fri Sat Sun Mon
## [721] Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri
## [739] Sat Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu
## [757] Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Tue Wed Thu Fri
## [775] Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue
## [793] Wed Thu Fri Sat Sun Mon Tue Wed Thu Tue Wed Thu Fri Sat Sun Mon Tue Wed
## [811] Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Tue Wed Thu Fri Sat Sun Mon Tue
## [829] Wed Thu Fri Sat Sun Mon Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri
## [847] Sat Sun Mon Tue Wed Thu Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat
## [865] Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed
## [883] Thu Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu
## [901] Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Tue Wed Thu Fri Sat Sun
## [919] Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu
## [937] Fri Sat Sun Mon Tue Wed Thu
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
str(combined_data)
## 'data.frame':    943 obs. of  19 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ Date                    : Date, format: "2016-04-12" "2016-04-13" ...
##  $ TotalSteps              : num  13162 10735 10460 9762 12669 ...
##  $ TotalDistance           : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : num  25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : num  13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : num  328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : num  728 776 1218 726 773 ...
##  $ Calories                : num  1985 1797 1776 1745 1863 ...
##  $ TotalSleepRecords       : num  1 2 0 1 2 1 0 1 1 1 ...
##  $ TotalMinutesAsleep      : num  327 384 0 412 340 700 0 304 360 325 ...
##  $ TotalTimeInBed          : num  346 407 0 442 367 712 0 320 377 364 ...
##  $ DayOfTheWeek            : chr  "2" "3" "4" "5" ...

Combining the different columns of minutes recorded.

combined_data$TotalMinutes = combined_data$VeryActiveMinutes+combined_data$FairlyActiveMinutes+combined_data$LightlyActiveMinutes+combined_data$SedentaryMinutes
str(combined_data)
## 'data.frame':    943 obs. of  20 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ Date                    : Date, format: "2016-04-12" "2016-04-13" ...
##  $ TotalSteps              : num  13162 10735 10460 9762 12669 ...
##  $ TotalDistance           : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : num  25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : num  13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : num  328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : num  728 776 1218 726 773 ...
##  $ Calories                : num  1985 1797 1776 1745 1863 ...
##  $ TotalSleepRecords       : num  1 2 0 1 2 1 0 1 1 1 ...
##  $ TotalMinutesAsleep      : num  327 384 0 412 340 700 0 304 360 325 ...
##  $ TotalTimeInBed          : num  346 407 0 442 367 712 0 320 377 364 ...
##  $ DayOfTheWeek            : chr  "2" "3" "4" "5" ...
##  $ TotalMinutes            : num  1094 1033 1440 998 1040 ...

Checking to see the relationship between the times users logged in to the device across the days of the week.

combined_data$DayOfTheWeek = strftime(combined_data$Date,'%A')
combined_data$DayOfTheWeek = factor(combined_data$Day, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday","Sunday"))

plotting a graph to show the relationship between the “Frequency of logged in time” and the “Number of times logged in”.

ggplot(data=combined_data,aes(x=DayOfTheWeek, fill=DayOfTheWeek)) + geom_bar(stat = "count") +
  theme(plot.title = element_text(hjust = 0.5, lineheight = 0.8, face = "bold")) + 
  labs(x = 'Day of Week',
       y = 'Frequency of logged in Times',
       title = 'Number of times users logged in app across the week')

The app recorded greater logged-in times on Tuesdays, Wednesdays, and Thursdays, compared to a lesser logged-in time recorded for Friday till Monday. Due to some limitations such as information about the type of jobs, age groups, etc of the users, the report cannot be fully utilized in giving a detailed picture of why users could be taken lesser steps on certain days of the week.

It could only be assumed that the users stay in-doors during those weekends, and then take more steps at work during the weekdays. A more detailed information about the demographic of the users and even data from more users will help to provide additional insights to this kind of analysis.

On the other end, it could be that users drive around on weekends to places when they are not at work. If that be the case, then: Business Insight: Users may be updated about the need to take a walk during the weekends rather than driving or being at a single spot longer than usual on weekends.

Investigating the combined data

combined_data %>%  
  select(TotalSteps,
  TotalDistance,
  TotalMinutes,
  SedentaryMinutes,
  Calories) %>%
  summary()
##    TotalSteps    TotalDistance     TotalMinutes    SedentaryMinutes
##  Min.   :    0   Min.   : 0.000   Min.   :   2.0   Min.   :   0.0  
##  1st Qu.: 3795   1st Qu.: 2.620   1st Qu.: 989.5   1st Qu.: 729.0  
##  Median : 7439   Median : 5.260   Median :1440.0   Median :1057.0  
##  Mean   : 7652   Mean   : 5.503   Mean   :1218.2   Mean   : 990.4  
##  3rd Qu.:10734   3rd Qu.: 7.720   3rd Qu.:1440.0   3rd Qu.:1229.0  
##  Max.   :36019   Max.   :28.030   Max.   :1440.0   Max.   :1440.0  
##     Calories   
##  Min.   :   0  
##  1st Qu.:1830  
##  Median :2140  
##  Mean   :2308  
##  3rd Qu.:2796  
##  Max.   :4900

Investigating the relationship between the “TotalDistance and the Calories”, and also the “TotalSteps and the Calories”.

g1= ggplot(data=combined_data)+
      geom_point(mapping = aes(x = TotalDistance, y =Calories),color="green")+
      labs(title="Total Distance Vs Calories")
cor(combined_data$TotalDistance, combined_data$Calories)
## [1] 0.6466023
g2= ggplot(data=combined_data)+
      geom_point(mapping = aes(x = TotalSteps, y =Calories),color="blue")+
      labs(title="Total Steps Vs Calories")
cor(combined_data$TotalSteps, combined_data$Calories)
## [1] 0.5929493
ggarrange(g1, g2, ncol = 2, nrow = 1)

There is a moderately strong relationship (0.65) between the TotalDistance and the Calories and similarly a moderately strong relationship of 0.60 between the TotalSteps and the Calories.

As reported earlier, TotalSteps and TotalDistance are strongly correlated, hence this similar result from these two graphs. Obviously, from these two graphs, the more steps or distance covered, the more calories burnt by an individual.

Business Insight: This information can be updated in the device to inform users that they can stay fitted by taken more steps on a daily basis.

Detailed graphical representation of the relationship between calories burnt per steps taken for individual users.

ggplot(data = combined_data) + geom_point(mapping = aes(x=TotalSteps, y=Calories, color=TotalSteps)) +        scale_color_gradientn(colours = "terrain.colors"(12)) + 
  geom_hline(yintercept = 2308, color = "purple", size = 0.5) + 
  geom_vline(xintercept = 7652, color = "black", size = 0.5) +
  geom_text(aes(x=9500, y=2200, label="Mean"), color="black", size=5) +
theme(plot.title = element_text(hjust = 0.2, lineheight = 0.5, face = "bold")) +
  labs(
  x = 'Steps taken',
  y = 'Calories burned',
  title = 'Calories burnt per step taken')

cor(combined_data$TotalSteps, combined_data$Calories)
## [1] 0.5929493

Revised Business Recommendations To improve the marketing strategy for the Bellabeat app, the company may update the information provided to their users on a daily or weekly basis by using the various recommendations provided above.

One of these recommendations is to inform users and potential customers about the ability to track their lifestyles on a daily/weekly basis by utilizing all the resources available in this smart device. Having detailed insights about their daily lifestyles could enable them focus on the necessary adjustments to put in place to ensure healthy living. These resources include monitoring stress, mentsrual cycle, daily steps taken, etc. Encouraging users to log in and utilize every resources of the devices for record keeping will enable the company to provide the users with more information about their daily lifestyle.

The company may also need to investigate the device to address the itch that makes it provide some strange reports/outliers such as the 0 and/or 30,000 TotalSteps per day.

In conclusion, the company may use the information/recommendations provided above to determine/target potential customers.