About the Company Bellabeat:

Bellabeat is a cutting-edge technology company founded by UrÅ”ka SrÅ”en and Sandor Mur. All of their products are designed to improve people’s health. Sren drew on her background as an artist to develop innovative tools that give women access to information and inspiration on a global scale. By tracking their movement, rest, emotional state, and reproductive health, Bellabeat has provided women with the tools they need to take charge of their own health and well-being. Since its inception in 2013, Bellabeat, a tech-driven wellness company for women, has experienced explosive growth.

Products of the company:

Ivy: is a one-of-a-kind health and wellness tracker designed by women specifically for women. Ivy, a stylish bracelet that analyzes your physiological data, physical and mental activities, knows what you need to do to improve your self-care routines and achieve peak performance. This product was just released, and its data are not included in our analysis.

Bellabeat app: offers users with health-related information regarding their activity, sleep, stress, menstrual cycle, and mindfulness practices. This data can aid users in gaining a better understanding of their existing habits and in making healthy choices. The Bellabeat app is compatible with their smart wellness product line.

Leaf: The Leaf wellness tracker by Bellabeat can be worn as a bracelet, necklace, or clip. Leaf connects to the Bellabeat app to monitor exercise, sleep, and stress. This health watch blends the timeless design of a classic clock with advanced technology to monitor the wearer’s activity, sleep, and stress levels. The Time watch connects to the Bellabeat app in order to deliver daily wellness information.

Spring: This is a smart water bottle that monitors daily water intake to ensure that you remain well hydrated throughout the day. The Spring bottle is integrated with the Bellabeat app to monitor hydration levels.

Membership on Bellabeat Bellabeat also provides consumers with a subscription-based membership scheme. Membership grants customers access to fully customized advice on diet, fitness, sleep, health and beauty, and mindfulness, depending on their lifestyle and objectives, 24 hours a day, seven days a week.

ASK

Specifically, I’ll be looking at the data from Bellabeat’s Leaf and Time products to learn more about how people are employing smart watches.

Business Task:

SrŔen requests that the data regarding the usage of smart devices be analyzed in order to acquire insight into the manner in which customers utilize smart devices that are not Bellabeat products.

Business Objective:

How has the use of smart devices evolved recently?
Wearable fitness technology, including gadgets like FitBits and smartwatches, has established itself as a viable niche in the healthcare market. Consumers’ interest in tracking their own health and vital signs has led to a tripling in the adoption of wearable devices over the past four years.Wearables are predicted to remain popular over the next few years as more people become open to sharing their health data with healthcare professionals and insurance. Insider Intelligence predicted in October 2021 that the US Smart wearable user market would expand 25.5% YoY in 2023, up from 23.3% YoY growth in 2021.SmartDevices Evolution

Business Deliverables:

Find the key differences between Fitbit users and Bellabeat users and how digital media and other factors could influence them.

Stakeholders:

UrÅ”ka SrÅ”en: Bellabeat’s co founder and Chief Creative Officer Sando Mur: Mathematician and Bellabeat co founder; key member of the Bellabeat executive team Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy.

PREPARE

Data Source:

The data is located in a kaggle dataset

CC0: Public Domain

This dataset is made available through Mobius

Data Description:

This dataset is hosted on Kaggle and was made public through user Mobius in an open-source format. Hence, the data is public and available to be copied, updated, and distributed, all without asking the user for permission. According to reports, these datasets were created by respondents to a distributed poll conducted by Amazon Mechanical Turk between March 12 and May 12, 2016. Thirty qualified Fitbit users reportedly (see credibility section immediately below) agreed to submit personal tracker data, to include information about their daily activity (number of steps walked, calories burnt, time awake, heart rate, and distance traveled). This information was compiled by the minute, the hour, and the day. Eighteen CSV files provide this information. After saving all 18 files to my laptop, I decided to use just 3 of them because they contained all the activities, sleep data and weight Log Information. For security purposes, the rest of the files have been wiped clean. The 3 files that were used for further analysis are:
sleepDay_merged.csv.
dailyActivity_merged.csv.
weightLogInfo.csv.

Data Limitations:

Confirming the ROCCC process:
Reliable: No, the data is not reliable because there are so few people represented in the sample (33). This large of a number increases the likelihood of statistical error.
Original: No, a third-party service generates the original dataset. Amazon Mechanical Turk.
Comprehensive: Yes/No, the information is highly relevant to the Bellabeat Leaf product’s sleep and activity characteristics but does not represent any other features.
Current: A recent study, yes; this one is 7 years old, so may-not be relevant. Cited: Referenced - The information was gathered without revealing any personal details.

Aside from the ID and LogId number there is no personal information within the data collected. So there are no privacy concerns to address. The participants remain anonymous. That being said, I do not know the age or gender, color, status of these participants so I am unaware of bias. Note: Overall, this is not a quality dataset to be used for actual business recommendations.

PROCESS

Make data more understandable and readable by cleaning and formatting it. At this point, the data has been organized by adding columns, extracting relevant information, and eliminating any errors or duplication. To keep things straightforward, I’ve compiled everything into R. By transforming the CSV files into tables and then linking those tables together using common properties, I was able to simply handle the full set of files and run the necessary queries.

Upload CSV Files Into R

I uploaded the CSV files to my project from the relevant data sources mentioned above.

Installing and loading common packages and libraries

install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("lubridate")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("janitor")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("readr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("tidyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## āœ” dplyr     1.1.1     āœ” readr     2.1.4
## āœ” forcats   1.0.0     āœ” stringr   1.5.0
## āœ” ggplot2   3.4.1     āœ” tibble    3.2.1
## āœ” lubridate 1.9.2     āœ” tidyr     1.3.0
## āœ” purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## āœ– dplyr::filter() masks stats::filter()
## āœ– dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(lubridate)
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(dplyr)
library(ggplot2)
library(tidyr)

Loading the .CSV Files from the Kaggle Dataset provided by the Case Study

Here I am considering three CSV files dailyActivity_merged.csv, sleepDay_merged.csv, weightLogInfo.csv instead of loading all 18 files in R. The reason why I considered loading and analyzing these 3 files were that the dailyActivity_merged.csv contains a lot of same entities as the rest of the tables e.g.Ā calories, intensity, distance and steps data recorded on a daily basis. So to avoid the duplicacy of data I have considered only 3 datasets.

dailyActivity <- read_csv("/cloud/project/Fitabase Data 4.12.16 - 5.12.16/dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
(dailyActivity)
## # A tibble: 940 Ɨ 15
##            Id Activity…¹ Total…² Total…³ Track…⁓ Logge…⁵ VeryA…⁶ Moder…⁷ Light…⁸
##         <dbl> <chr>        <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 1503960366 4/12/2016    13162    8.5     8.5        0    1.88   0.550    6.06
##  2 1503960366 4/13/2016    10735    6.97    6.97       0    1.57   0.690    4.71
##  3 1503960366 4/14/2016    10460    6.74    6.74       0    2.44   0.400    3.91
##  4 1503960366 4/15/2016     9762    6.28    6.28       0    2.14   1.26     2.83
##  5 1503960366 4/16/2016    12669    8.16    8.16       0    2.71   0.410    5.04
##  6 1503960366 4/17/2016     9705    6.48    6.48       0    3.19   0.780    2.51
##  7 1503960366 4/18/2016    13019    8.59    8.59       0    3.25   0.640    4.71
##  8 1503960366 4/19/2016    15506    9.88    9.88       0    3.53   1.32     5.03
##  9 1503960366 4/20/2016    10544    6.68    6.68       0    1.96   0.480    4.24
## 10 1503960366 4/21/2016     9819    6.34    6.34       0    1.34   0.350    4.65
## # … with 930 more rows, 6 more variables: SedentaryActiveDistance <dbl>,
## #   VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## #   LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>, and
## #   abbreviated variable names ¹​ActivityDate, ²​TotalSteps, ³​TotalDistance,
## #   ⁓​TrackerDistance, ⁵​LoggedActivitiesDistance, ⁶​VeryActiveDistance,
## #   ⁷​ModeratelyActiveDistance, ⁸​LightActiveDistance
sleepDay <- read_csv("/cloud/project/Fitabase Data 4.12.16 - 5.12.16/sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
(sleepDay)
## # A tibble: 413 Ɨ 5
##            Id SleepDay              TotalSleepRecords TotalMinutesAsleep Total…¹
##         <dbl> <chr>                             <dbl>              <dbl>   <dbl>
##  1 1503960366 4/12/2016 12:00:00 AM                 1                327     346
##  2 1503960366 4/13/2016 12:00:00 AM                 2                384     407
##  3 1503960366 4/15/2016 12:00:00 AM                 1                412     442
##  4 1503960366 4/16/2016 12:00:00 AM                 2                340     367
##  5 1503960366 4/17/2016 12:00:00 AM                 1                700     712
##  6 1503960366 4/19/2016 12:00:00 AM                 1                304     320
##  7 1503960366 4/20/2016 12:00:00 AM                 1                360     377
##  8 1503960366 4/21/2016 12:00:00 AM                 1                325     364
##  9 1503960366 4/23/2016 12:00:00 AM                 1                361     384
## 10 1503960366 4/24/2016 12:00:00 AM                 1                430     449
## # … with 403 more rows, and abbreviated variable name ¹​TotalTimeInBed
weightLogInfo <- read_csv("/cloud/project/Fitabase Data 4.12.16 - 5.12.16/weightLogInfo_merged.csv")
## Rows: 67 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
## lgl (1): IsManualReport
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
(weightLogInfo)
## # A tibble: 67 Ɨ 8
##            Id Date                  WeightKg Weigh…¹   Fat   BMI IsMan…²   LogId
##         <dbl> <chr>                    <dbl>   <dbl> <dbl> <dbl> <lgl>     <dbl>
##  1 1503960366 5/2/2016 11:59:59 PM      52.6    116.    22  22.6 TRUE    1.46e12
##  2 1503960366 5/3/2016 11:59:59 PM      52.6    116.    NA  22.6 TRUE    1.46e12
##  3 1927972279 4/13/2016 1:08:52 AM     134.     294.    NA  47.5 FALSE   1.46e12
##  4 2873212765 4/21/2016 11:59:59 PM     56.7    125.    NA  21.5 TRUE    1.46e12
##  5 2873212765 5/12/2016 11:59:59 PM     57.3    126.    NA  21.7 TRUE    1.46e12
##  6 4319703577 4/17/2016 11:59:59 PM     72.4    160.    25  27.5 TRUE    1.46e12
##  7 4319703577 5/4/2016 11:59:59 PM      72.3    159.    NA  27.4 TRUE    1.46e12
##  8 4558609924 4/18/2016 11:59:59 PM     69.7    154.    NA  27.2 TRUE    1.46e12
##  9 4558609924 4/25/2016 11:59:59 PM     70.3    155.    NA  27.5 TRUE    1.46e12
## 10 4558609924 5/1/2016 11:59:59 PM      69.9    154.    NA  27.3 TRUE    1.46e12
## # … with 57 more rows, and abbreviated variable names ¹​WeightPounds,
## #   ²​IsManualReport

Clean the data so I can work effectively on it, also consider if I have enough data in all the 3 datasets to work on

count(distinct(dailyActivity, Id))
## # A tibble: 1 Ɨ 1
##       n
##   <int>
## 1    33

For the result: I got 33 row

count(distinct(sleepDay, Id))
## # A tibble: 1 Ɨ 1
##       n
##   <int>
## 1    24

Here, I retrieved 24 rows

count(distinct(weightLogInfo, Id))
## # A tibble: 1 Ɨ 1
##       n
##   <int>
## 1     8

Here I retrieved only 8 rows, which is quite less to consider for our analysis

Now I am left to analyse only dailyActivity_merged.csv, sleepDay_merged.csv datasets based on the rows retrieved. Let’s check the data structure of dailyActivity and sleepDay to figure out if the datasets are cleaned and consistent.

str(dailyActivity)
## spc_tbl_ [940 Ɨ 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id                      : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : num [1:940] 13162 10735 10460 9762 12669 ...
##  $ TotalDistance           : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : num [1:940] 728 776 1218 726 773 ...
##  $ Calories                : num [1:940] 1985 1797 1776 1745 1863 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   ActivityDate = col_character(),
##   ..   TotalSteps = col_double(),
##   ..   TotalDistance = col_double(),
##   ..   TrackerDistance = col_double(),
##   ..   LoggedActivitiesDistance = col_double(),
##   ..   VeryActiveDistance = col_double(),
##   ..   ModeratelyActiveDistance = col_double(),
##   ..   LightActiveDistance = col_double(),
##   ..   SedentaryActiveDistance = col_double(),
##   ..   VeryActiveMinutes = col_double(),
##   ..   FairlyActiveMinutes = col_double(),
##   ..   LightlyActiveMinutes = col_double(),
##   ..   SedentaryMinutes = col_double(),
##   ..   Calories = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(sleepDay)
## spc_tbl_ [413 Ɨ 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id                : num [1:413] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ SleepDay          : chr [1:413] "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
##  $ TotalSleepRecords : num [1:413] 1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: num [1:413] 327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : num [1:413] 346 407 442 367 712 320 377 364 384 449 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   SleepDay = col_character(),
##   ..   TotalSleepRecords = col_double(),
##   ..   TotalMinutesAsleep = col_double(),
##   ..   TotalTimeInBed = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

The Activity_Date entity and SleepDay entity are both Date/Time however they are defined incorrectly as a double character format.So, changing the format of the same

dailyActivity$ActivityDate <- as.Date.character(dailyActivity$ActivityDate, format = "%m/%d/%Y")
sleepDay$SleepDay <- as.Date.character(sleepDay$SleepDay, format = "%m/%d/%Y")
head(dailyActivity)
## # A tibble: 6 Ɨ 15
##           Id ActivityD…¹ Total…² Total…³ Track…⁓ Logge…⁵ VeryA…⁶ Moder…⁷ Light…⁸
##        <dbl> <date>        <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 1503960366 2016-04-12    13162    8.5     8.5        0    1.88   0.550    6.06
## 2 1503960366 2016-04-13    10735    6.97    6.97       0    1.57   0.690    4.71
## 3 1503960366 2016-04-14    10460    6.74    6.74       0    2.44   0.400    3.91
## 4 1503960366 2016-04-15     9762    6.28    6.28       0    2.14   1.26     2.83
## 5 1503960366 2016-04-16    12669    8.16    8.16       0    2.71   0.410    5.04
## 6 1503960366 2016-04-17     9705    6.48    6.48       0    3.19   0.780    2.51
## # … with 6 more variables: SedentaryActiveDistance <dbl>,
## #   VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## #   LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>, and
## #   abbreviated variable names ¹​ActivityDate, ²​TotalSteps, ³​TotalDistance,
## #   ⁓​TrackerDistance, ⁵​LoggedActivitiesDistance, ⁶​VeryActiveDistance,
## #   ⁷​ModeratelyActiveDistance, ⁸​LightActiveDistance
glimpse(dailyActivity)
## Rows: 940
## Columns: 15
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate             <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-04-…
## $ TotalSteps               <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes        <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes      <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes     <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes         <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories                 <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…
head(sleepDay)
## # A tibble: 6 Ɨ 5
##           Id SleepDay   TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
##        <dbl> <date>                 <dbl>              <dbl>          <dbl>
## 1 1503960366 2016-04-12                 1                327            346
## 2 1503960366 2016-04-13                 2                384            407
## 3 1503960366 2016-04-15                 1                412            442
## 4 1503960366 2016-04-16                 2                340            367
## 5 1503960366 2016-04-17                 1                700            712
## 6 1503960366 2016-04-19                 1                304            320
glimpse(sleepDay)
## Rows: 413
## Columns: 5
## $ Id                 <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150…
## $ SleepDay           <date> 2016-04-12, 2016-04-13, 2016-04-15, 2016-04-16, 20…
## $ TotalSleepRecords  <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ TotalMinutesAsleep <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2…
## $ TotalTimeInBed     <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3…

Renaming the coloumns before merging makes it.

sleepDay <- rename(sleepDay, date = SleepDay)
dailyActivity <- rename(dailyActivity, date = ActivityDate)

Time to merge dailyActivity and sleepDay Datasets and using Id and date as a common key.

merged_daily_activity <- merge(x = dailyActivity, y = sleepDay, by = c("Id", "date"), all.x = TRUE )

Output of the merge

head(merged_daily_activity)
##           Id       date TotalSteps TotalDistance TrackerDistance
## 1 1503960366 2016-04-12      13162          8.50            8.50
## 2 1503960366 2016-04-13      10735          6.97            6.97
## 3 1503960366 2016-04-14      10460          6.74            6.74
## 4 1503960366 2016-04-15       9762          6.28            6.28
## 5 1503960366 2016-04-16      12669          8.16            8.16
## 6 1503960366 2016-04-17       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728
##   TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1                 1                327            346
## 2                 2                384            407
## 3                NA                 NA             NA
## 4                 1                412            442
## 5                 2                340            367
## 6                 1                700            712

I can break the merged dataset for looking into the customer behaviour per day of the week and so I can calculate this by the referencing the date coloumn

merged_daily_activity <- transform(merged_daily_activity, Weekday = weekdays(date))
glimpse(merged_daily_activity)
## Rows: 943
## Columns: 19
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ date                     <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-04-…
## $ TotalSteps               <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes        <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes      <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes     <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes         <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories                 <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…
## $ TotalSleepRecords        <dbl> 1, 2, NA, 1, 2, 1, NA, 1, 1, 1, NA, 1, 1, 1, …
## $ TotalMinutesAsleep       <dbl> 327, 384, NA, 412, 340, 700, NA, 304, 360, 32…
## $ TotalTimeInBed           <dbl> 346, 407, NA, 442, 367, 712, NA, 320, 377, 36…
## $ Weekday                  <chr> "Tuesday", "Wednesday", "Thursday", "Friday",…

Lets check for the duplicacy of the data

sum(duplicated(merged_daily_activity))
## [1] 3

Running the code, I found that there are 3 rows which are duplicated. Let’s drop these rows to avoid duplicacy of data.

merged_daily_activity <- merged_daily_activity %>% distinct() %>% drop_na()

Verifying the data again to have a clean data

sum(duplicated(merged_daily_activity))
## [1] 0

Finally got a clean result. Moving onto the Analyze and Share Phase

ANALYZE AND SHARE

Let’s summarize the data

merged_daily_activity %>% 
  select(
    TotalSteps,
    TotalDistance,
    VeryActiveMinutes,
    FairlyActiveMinutes,
    LightlyActiveMinutes,
    SedentaryMinutes,
    TotalMinutesAsleep,
    TotalTimeInBed,
    Calories
  ) %>% 
  
  summary()
##    TotalSteps    TotalDistance    VeryActiveMinutes FairlyActiveMinutes
##  Min.   :   17   Min.   : 0.010   Min.   :  0.00    Min.   :  0.00     
##  1st Qu.: 5189   1st Qu.: 3.592   1st Qu.:  0.00    1st Qu.:  0.00     
##  Median : 8913   Median : 6.270   Median :  9.00    Median : 11.00     
##  Mean   : 8515   Mean   : 6.012   Mean   : 25.05    Mean   : 17.92     
##  3rd Qu.:11370   3rd Qu.: 8.005   3rd Qu.: 38.00    3rd Qu.: 26.75     
##  Max.   :22770   Max.   :17.540   Max.   :210.00    Max.   :143.00     
##  LightlyActiveMinutes SedentaryMinutes TotalMinutesAsleep TotalTimeInBed 
##  Min.   :  2.0        Min.   :   0.0   Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:158.0        1st Qu.: 631.2   1st Qu.:361.0      1st Qu.:403.8  
##  Median :208.0        Median : 717.0   Median :432.5      Median :463.0  
##  Mean   :216.5        Mean   : 712.1   Mean   :419.2      Mean   :458.5  
##  3rd Qu.:263.0        3rd Qu.: 782.8   3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :518.0        Max.   :1265.0   Max.   :796.0      Max.   :961.0  
##     Calories   
##  Min.   : 257  
##  1st Qu.:1841  
##  Median :2207  
##  Mean   :2389  
##  3rd Qu.:2920  
##  Max.   :4900

The Output received is:
- TotalSteps(Average) : 8515.
- TotalDistance(Average) : 6.012.
- VeryActiveMinutes(Average) :25.05 minutes.
- FailrlyActiveMinutes(Average) : 17.92 minutes.
- LightlyActiveMinutes(Average) : 216.50 minutes.
- SedentaryMinutes(Average) : 712.10 minutes.
- TotalMinutesAsleep(Average) : 419.2 minutes.
- TotalTimeInBed(Average) : 458.50 minutes.
- Calories(Average) : 2389.

Planned a few more investigations into user behaviors.

Lets analyze the data based on the criteria’s found after merging the tables

Calories Vs Very Active Minutes

Calories Vs Total Distance covered and Total Steps

Sedentary Minutes Vs Total Minutes Sleep and Total Time In Bed

install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(ggplot2)

Total Distance Vs Calories

According to the results of my research and my observations, I have found that a person will expend more calories if they travel a greater distance.

ggplot(data = merged_daily_activity) +
  geom_point(mapping = aes(x = Calories, y = TotalDistance, fill = TotalSteps, color = TotalSteps))

Calories Vs Active Minutes

According to the findings of my research and the findings of my observation, if a person is physically active, they will burn more calories.

ggplot(data = merged_daily_activity) +
  geom_point(mapping = aes(x = Calories, y = VeryActiveMinutes))

Sedentary Minutes Vs Total Steps Taken and Calories

According to the results of my research, I have determined that there is an inverse relationship between the number of minutes spent sitting and the total number of steps taken. There is no connection between the amount of time spent sitting and the number of calories burned. In general, the number of calories burned is proportional to the total number of steps done.

ggplot(data = merged_daily_activity) +
  geom_point(mapping = aes(x = TotalSteps, y = SedentaryMinutes, color = Calories))

Sedentary Minutes Vs Calories.

According to the results of my research and my observations, it seems that even when we are inactive, our bodies continue to expend some calories, but not nearly as much as when we are actively engaged in anything.

ggplot(data = merged_daily_activity) +
  geom_point(mapping = aes(x = Calories, y = SedentaryMinutes))

Sleep Pattern

My observations indicate that people sleep more during the week and exercise more because they are well rested.

ggplot(data = merged_daily_activity) +
  geom_point(mapping = aes(x = Weekday, y = TotalMinutesAsleep, fill = TotalMinutesAsleep))

Interpreting Statistical findings:

install.packages("ggpubr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(ggpubr)

By comparing these three scenarios, I concluded that ā€œTotal Distance Covered is more closely associated withā€Caloriesā€

p1 = ggplot(data = merged_daily_activity) +
  geom_point(mapping = aes(x = TotalDistance, y = Calories), color = "purple")
p2 = ggplot(data = merged_daily_activity) +
  geom_point(mapping = aes(x = TotalSteps, y = Calories), color = "green")
p3 = ggplot(data = merged_daily_activity) +
  geom_point(mapping = aes(x = VeryActiveMinutes, y = Calories), color = "orange")
library(cowplot)
## 
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggpubr':
## 
##     get_legend
## The following object is masked from 'package:lubridate':
## 
##     stamp
plot_grid(p1, p2, p3, labels = c("Total Distance Vs. Calories", "Total Steps Vs. Calories", "Total Minutes Active Vs. Calories"), ncol = 3, nrow = 1)

ACT

Insights gained during this Analysis:

Clearly communicate to the stakeholders of Bellabeat the insights that I have gained throughout the course of this data analysis project in such a way that it assists the stakeholders of Bellabeat in driving future data analysis projects for the purpose of assisting marketing strategies and promoting future growth. The most important takeaways are: According to the findings, the FitBit users appear to have sampled their recorded Step data more than twice as frequently as they recorded their Sleep Data. If this is indeed the case, then new data sources may present further chances for Bellabeat to educate its clients about the relative significance of getting enough sleep and staying active.

An overview of the results is as follows:

The more active a person is, as measured by the total amount of time spent being active, the more calories that person will burn. The number of steps taken and the distance traveled both have a direct bearing on the amount of calories that are expended. When one looks at the graph, it is rather obvious that the people burned more calories when they had sufficient amounts of rest. Which brings us to the conclusion that our theory that being more active will not only help us retain excellent health, but it will also be advantageous at work as well as in our personal lives, where we would be able to be more productive and sleep better.

Recommendation for Bellabeat Marketing Team:

Bellabeat should advertise the benefits of their products alongside the advantages of walking, running, or other forms of exercise, as well as the fact that Bellabeat products can help to monitor and manage healthy lifestyles by providing insights and data to continue improving and incorporate an active life. The application may be simple to use and can offer guidance to the consumer based on data patterns recorded by the application over a one-month period. As a result, it is essential to enable the individualsĀ to progressively improve from a Sedentary Lifestyle to a Casually Active to a Fairly Active Lifestyle, and to assist consumers in achieving their goal as an incentive to do so.

This case study provided me with a wealth of knowledge and insights for carrying out the analysis process from beginning to conclusion.