In this capstone project, we delve into the real-world scenario of Bellabeat, a high-tech manufacturer of health-focused products for women. As a successful small company, Bellabeat has the potential to make a significant impact in the global smart device market. Urška Sršen, co-founder, and Chief Creative Officer of Bellabeat, recognizes the value of analyzing smart device fitness data as a means to unlock new growth opportunities. As a junior data analyst on the marketing analyst team, I have been assigned the task of focusing on one of Bellabeat’s products and analyzing smart device data to gain insights into consumer usage patterns. The objective is to provide actionable recommendations that will guide Bellabeat’s marketing strategy, enabling the company to capitalize on the vast potential of smart device data analysis.
The primary objective of this capstone project is to analyze smart device data to gain valuable insights into how consumers are utilizing Bellabeat’s products. By examining user behavior, engagement patterns, and other relevant metrics, we aim to uncover significant trends and behaviors that can inform Bellabeat’s marketing strategy. The insights derived from this analysis will serve as a foundation for developing targeted marketing campaigns, improving customer experiences, and driving innovation within the company.
Analyze non-Bellabeat smart device data alongside a specific Bellabeat product to extract insights, identify growth opportunities, and provide recommendations for improving BellaBeat’s marketing strategy by leveraging trends in smart device usage.
The FitBit Fitness Tracker Data is available on Kaggle and was made accessible through Mobius. The dataset consists of 18 CSV files containing smart health data from personal fitness trackers for thirty FitBit users. The data was collected via a survey of personal tracker data, which included minute-level output for physical activity, heart rate, and sleep monitoring. The survey was conducted using Amazon Mechanical Turk between March 12, 2016, and May 12, 2016.
The dataset provides comprehensive information about daily activity, steps, and heart rate. It was last updated two years ago, as of May 2023. The data was generated through a distributed survey via Amazon Mechanical Turk, in which thirty eligible Fitbit users consented to submit their personal tracker data. The minute-level output includes details on physical activity, heart rate, and sleep monitoring.
The variation observed in the data stems from the use of different types of Fitbit trackers and individual tracking behaviors and preferences. With this rich dataset, it is possible to explore and analyze the impact of various factors on fitness and health-related metrics captured by FitBit devices.
Verifying the metadata of our dataset we can confirm it is open-source. The owner has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law. You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.
The dataset consists of 18 CSV documents that contain diverse quantitative data tracked by FitBit. The data is structured in a long format, where each row represents a specific time point for a particular subject. Consequently, multiple rows exist for each subject, identified by their unique ID, as data is tracked on a daily and time basis.
Given the relatively small sample size, I employed sorting and filtering techniques in Google Sheets to organize the data. By creating Pivot Tables, I could examine the attributes and observations in each table, as well as establish relationships between them. Additionally, I performed a count of the sample size (number of users) in each table and verified that the analysis spanned a period of 31 days.
Considering the limited size of the dataset (30 users) and the absence of demographic information, there is a possibility of encountering sampling bias. The representativeness of the sample in relation to the broader population cannot be guaranteed. Additionally, the dataset’s lack of currentness, along with the time restriction of the survey (spanning only 2 months), poses additional challenges.
To address these limitations, we will adopt an operational approach for our case study. By focusing on actionable insights and practical implications, we aim to derive meaningful conclusions and recommendations despite the inherent constraints. This approach allows us to leverage the available data effectively and provide valuable insights within the given context.
I have made the decision to utilize R as the primary tool for my analysis due to its accessibility, data processing capabilities, and visualization features. R is a widely adopted open-source programming language that offers a multitude of packages and functions specifically designed for data analysis and statistical tasks. By leveraging R’s extensive ecosystem, I can take advantage of its robust functionality to efficiently handle and manipulate the dataset.
The dataset in question contains a substantial amount of data, making R’s efficiency in handling large datasets a crucial factor in my analysis. R’s optimized data processing capabilities allow me to effectively clean, transform, and derive insights from the dataset, enabling me to uncover valuable patterns and trends.
Furthermore, R’s powerful visualization libraries, including ggplot2 and plotly, empower me to create visually appealing and informative data visualizations. These visualizations serve as powerful tools for communicating the analysis results to stakeholders in a clear and concise manner. By presenting the findings through engaging and intuitive visual representations, I can enhance understanding, facilitate decision-making, and effectively convey the key insights derived from the dataset.
By leveraging the accessibility, data processing capabilities, and visualization features of R, I am confident that I can conduct a comprehensive analysis that not only uncovers valuable insights but also effectively communicates the findings to stakeholders.
In our R programming workflow, we will carefully select and load the essential packages to enhance our analysis capabilities. The following R packages have been curated specifically for our analysis:
library(tidyverse)## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(dplyr)
library(ggplot2)
library(tidyr)
library(ggpubr)
library(here)## here() starts at /cloud/project
library(skimr)
library(janitor)##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(ggrepel)
library(hms)##
## Attaching package: 'hms'
##
## The following object is masked from 'package:lubridate':
##
## hms
library(scales)##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
library(formatR)
library(glue)Knowing the datasets we have, we will upload the datasets that will help us answer our business task. On our analysis we will focus on the following datasets.
Due to the the small sample we won’t consider for this analysis Weight (8 Users) and heart rate (7 users)
daily_activity <- read_csv(file= "/cloud/project/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
daily_sleep <- read_csv(file= "/cloud/project/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
hourly_steps <- read_csv("/cloud/project/Fitabase Data 4.12.16-5.12.16/hourlySteps_merged.csv")## Rows: 22099 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (2): Id, StepTotal
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
To gain an initial understanding of our selected data frames and their contents, we will preview them and examine the summary statistics for each column. This process will provide us with insights into the structure, variables, and distribution of the data. By doing so, we can gather valuable information that will aid us in subsequent analysis and decision-making.
head(daily_activity)## # A tibble: 6 × 15
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 13162 8.5 8.5
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## # ℹ 10 more variables: LoggedActivitiesDistance <dbl>,
## # VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## # LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## # VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## # LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>
str(daily_activity)## spc_tbl_ [940 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDate : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ TotalSteps : num [1:940] 13162 10735 10460 9762 12669 ...
## $ TotalDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : num [1:940] 728 776 1218 726 773 ...
## $ Calories : num [1:940] 1985 1797 1776 1745 1863 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. ActivityDate = col_character(),
## .. TotalSteps = col_double(),
## .. TotalDistance = col_double(),
## .. TrackerDistance = col_double(),
## .. LoggedActivitiesDistance = col_double(),
## .. VeryActiveDistance = col_double(),
## .. ModeratelyActiveDistance = col_double(),
## .. LightActiveDistance = col_double(),
## .. SedentaryActiveDistance = col_double(),
## .. VeryActiveMinutes = col_double(),
## .. FairlyActiveMinutes = col_double(),
## .. LightlyActiveMinutes = col_double(),
## .. SedentaryMinutes = col_double(),
## .. Calories = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
head(daily_sleep)## # A tibble: 6 × 5
## Id SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:0… 1 327 346
## 2 1503960366 4/13/2016 12:0… 2 384 407
## 3 1503960366 4/15/2016 12:0… 1 412 442
## 4 1503960366 4/16/2016 12:0… 2 340 367
## 5 1503960366 4/17/2016 12:0… 1 700 712
## 6 1503960366 4/19/2016 12:0… 1 304 320
str(daily_sleep)## spc_tbl_ [413 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:413] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ SleepDay : chr [1:413] "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
## $ TotalSleepRecords : num [1:413] 1 2 1 2 1 1 1 1 1 1 ...
## $ TotalMinutesAsleep: num [1:413] 327 384 412 340 700 304 360 325 361 430 ...
## $ TotalTimeInBed : num [1:413] 346 407 442 367 712 320 377 364 384 449 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. SleepDay = col_character(),
## .. TotalSleepRecords = col_double(),
## .. TotalMinutesAsleep = col_double(),
## .. TotalTimeInBed = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
head(hourly_steps)## # A tibble: 6 × 3
## Id ActivityHour StepTotal
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 373
## 2 1503960366 4/12/2016 1:00:00 AM 160
## 3 1503960366 4/12/2016 2:00:00 AM 151
## 4 1503960366 4/12/2016 3:00:00 AM 0
## 5 1503960366 4/12/2016 4:00:00 AM 0
## 6 1503960366 4/12/2016 5:00:00 AM 0
str(hourly_steps)## spc_tbl_ [22,099 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:22099] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityHour: chr [1:22099] "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
## $ StepTotal : num [1:22099] 373 160 151 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. ActivityHour = col_character(),
## .. StepTotal = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
daily_activity %>%
select(TotalSteps,
TotalDistance,
SedentaryMinutes, Calories) %>%
summary()## TotalSteps TotalDistance SedentaryMinutes Calories
## Min. : 0 Min. : 0.000 Min. : 0.0 Min. : 0
## 1st Qu.: 3790 1st Qu.: 2.620 1st Qu.: 729.8 1st Qu.:1828
## Median : 7406 Median : 5.245 Median :1057.5 Median :2134
## Mean : 7638 Mean : 5.490 Mean : 991.2 Mean :2304
## 3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.:1229.5 3rd Qu.:2793
## Max. :36019 Max. :28.030 Max. :1440.0 Max. :4900
daily_sleep %>%
select(TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed) %>%
summary()## TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## Min. :1.000 Min. : 58.0 Min. : 61.0
## 1st Qu.:1.000 1st Qu.:361.0 1st Qu.:403.0
## Median :1.000 Median :433.0 Median :463.0
## Mean :1.119 Mean :419.5 Mean :458.6
## 3rd Qu.:1.000 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :3.000 Max. :796.0 Max. :961.0
hourly_steps %>%
select(StepTotal) %>%
summary()## StepTotal
## Min. : 0.0
## 1st Qu.: 0.0
## Median : 40.0
## Mean : 320.2
## 3rd Qu.: 357.0
## Max. :10554.0
Based on the available data, several key observations can be made:
Average sedentary time: The average sedentary time is calculated to be 991 minutes or approximately 16 hours. This indicates a significant amount of time spent in a sedentary state. It is evident that efforts should be made to reduce sedentary behavior for better overall health.
Activity level: The majority of participants in the study exhibit light activity levels. This suggests that they engage in low-intensity physical activities throughout the day. While some activity is present, it may be beneficial for individuals to incorporate more moderate or vigorous activities into their routines to achieve optimal health outcomes.
Sleep duration: On average, participants sleep once a day for approximately 7 hours (432.5 minutes). This aligns with the recommended sleep duration for adults. The data indicates that most participants maintain a consistent sleep pattern without any notable sleep disorders or disruptions.
Average daily steps: The average total steps per day is calculated to be 7,638. While some level of physical activity is observed, this falls short of the commonly recommended goal of 10,000 steps per day for health benefits. Notably, research by the CDC suggests that taking 8,000 steps per day is associated with a 51% lower risk of all-cause mortality, while taking 12,000 steps per day is associated with a 65% lower risk compared to taking 4,000 steps. Therefore, there is room for improvement in increasing daily step count to reap more significant health benefits.
From these initial findings, it is evident that the average user of health-tracker data demonstrates a baseline level of activity, prioritizes sleep with an adequate duration, but may benefit from incorporating more physical activity into their daily routine to optimize their health outcomes. These observations provide valuable insights into user behavior and can inform future strategies for promoting healthier lifestyles.
Having familiarized ourselves with the data structures, our next step is to process them in order to identify and rectify any errors or inconsistencies. By carefully examining the data, we can detect missing values, outliers, and other data quality issues that may impact the integrity and accuracy of our analysis. Through data processing techniques such as data cleaning, validation, and transformation, we aim to ensure that the data is reliable and suitable for further analysis. This meticulous approach will enhance the reliability and validity of our findings, enabling us to draw meaningful insights and make informed decisions based on the processed data.
Before proceeding with the data cleaning process, it is important to determine the number of unique users in each data frame. While we acknowledge that the sample size is minimal with 30 users, we will still retain the sleep dataset for the purpose of practicing data cleaning techniques. By identifying the number of unique users in each data frame, we can gain insights into the diversity of the sample and understand the coverage of the data across different users.
n_unique(daily_activity$Id)## [1] 33
n_unique(daily_sleep$Id)## [1] 24
n_unique(hourly_steps$Id)## [1] 33
To ensure data integrity, our next step is to identify and remove any duplicate entries in the datasets. Given the length of observations in the daily_sleep dataset (413), we can confidently proceed with the removal of duplicates specifically in this dataset. By eliminating duplicate entries, we can avoid potential biases or inaccuracies in our analysis, leading to more reliable and accurate results. The removal of duplicates will enhance the overall quality of our data and contribute to a more robust analysis.
sum(duplicated(daily_activity))## [1] 0
sum(duplicated(daily_sleep))## [1] 3
sum(duplicated(hourly_steps))## [1] 0
daily_activity <- daily_activity %>%
distinct() %>%
drop_na()
daily_sleep <- daily_sleep %>%
distinct() %>%
drop_na()
hourly_steps <- hourly_steps %>%
distinct() %>%
drop_na()After removing the duplicates from the daily_sleep dataset, we will now verify that the duplicates have been successfully eliminated. This verification step is crucial to ensure the accuracy and integrity of our data. By confirming the absence of duplicates, we can be confident that our dataset is now free from redundant entries and ready for further analysis. This verification process adds an extra layer of quality control and allows us to proceed with our analysis using a reliable and clean dataset.
sum(duplicated(daily_sleep))## [1] 0
To ensure consistency and compatibility among the datasets for future merging, we will standardize the column names by applying the right syntax and format. Specifically, we will convert all column names to lowercase to maintain uniformity throughout the datasets. This transformation will help avoid any potential issues when merging the datasets and ensure that the column names are in a standardized format for ease of analysis. By adhering to a consistent naming convention, we can streamline our data processing and analysis workflows.
clean_names(daily_activity)## # A tibble: 940 × 15
## id activity_date total_steps total_distance tracker_distance
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 13162 8.5 8.5
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## 7 1503960366 4/18/2016 13019 8.59 8.59
## 8 1503960366 4/19/2016 15506 9.88 9.88
## 9 1503960366 4/20/2016 10544 6.68 6.68
## 10 1503960366 4/21/2016 9819 6.34 6.34
## # ℹ 930 more rows
## # ℹ 10 more variables: logged_activities_distance <dbl>,
## # very_active_distance <dbl>, moderately_active_distance <dbl>,
## # light_active_distance <dbl>, sedentary_active_distance <dbl>,
## # very_active_minutes <dbl>, fairly_active_minutes <dbl>,
## # lightly_active_minutes <dbl>, sedentary_minutes <dbl>, calories <dbl>
daily_activity<- rename_with(daily_activity, tolower)
clean_names(daily_sleep)## # A tibble: 410 × 5
## id sleep_day total_sleep_records total_minutes_asleep total_time_in_bed
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/201… 1 327 346
## 2 1.50e9 4/13/201… 2 384 407
## 3 1.50e9 4/15/201… 1 412 442
## 4 1.50e9 4/16/201… 2 340 367
## 5 1.50e9 4/17/201… 1 700 712
## 6 1.50e9 4/19/201… 1 304 320
## 7 1.50e9 4/20/201… 1 360 377
## 8 1.50e9 4/21/201… 1 325 364
## 9 1.50e9 4/23/201… 1 361 384
## 10 1.50e9 4/24/201… 1 430 449
## # ℹ 400 more rows
daily_sleep <- rename_with(daily_sleep, tolower)
clean_names(hourly_steps)## # A tibble: 22,099 × 3
## id activity_hour step_total
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 373
## 2 1503960366 4/12/2016 1:00:00 AM 160
## 3 1503960366 4/12/2016 2:00:00 AM 151
## 4 1503960366 4/12/2016 3:00:00 AM 0
## 5 1503960366 4/12/2016 4:00:00 AM 0
## 6 1503960366 4/12/2016 5:00:00 AM 0
## 7 1503960366 4/12/2016 6:00:00 AM 0
## 8 1503960366 4/12/2016 7:00:00 AM 0
## 9 1503960366 4/12/2016 8:00:00 AM 250
## 10 1503960366 4/12/2016 9:00:00 AM 1864
## # ℹ 22,089 more rows
hourly_steps <- rename_with(hourly_steps, tolower)With the column names standardized and converted to lowercase, our attention now shifts to cleaning the date-time format in the daily_activity and daily_sleep data frames, as we intend to merge these two datasets. Given that the time component in the daily_sleep data frame can be disregarded for our analysis, we will utilize the as_date function instead of as_datetime to convert the date-time values in both data frames to date format only. This step will ensure consistency in the date representation across the datasets and facilitate the merging process. By harmonizing the date formats, we can effectively combine the relevant information from both data frames and proceed with the subsequent stages of our analysis.
daily_activity <- daily_activity %>%
rename(date = activitydate) %>%
mutate(date = as_date(date, format = "%m/%d/%Y"))
daily_sleep <- daily_sleep %>%
rename(date = sleepday) %>%
mutate(date = as_date(date,format ="%m/%d/%Y %I:%M:%S %p"))head(daily_activity)## # A tibble: 6 × 15
## id date totalsteps totaldistance trackerdistance
## <dbl> <date> <dbl> <dbl> <dbl>
## 1 1503960366 2016-04-12 13162 8.5 8.5
## 2 1503960366 2016-04-13 10735 6.97 6.97
## 3 1503960366 2016-04-14 10460 6.74 6.74
## 4 1503960366 2016-04-15 9762 6.28 6.28
## 5 1503960366 2016-04-16 12669 8.16 8.16
## 6 1503960366 2016-04-17 9705 6.48 6.48
## # ℹ 10 more variables: loggedactivitiesdistance <dbl>,
## # veryactivedistance <dbl>, moderatelyactivedistance <dbl>,
## # lightactivedistance <dbl>, sedentaryactivedistance <dbl>,
## # veryactiveminutes <dbl>, fairlyactiveminutes <dbl>,
## # lightlyactiveminutes <dbl>, sedentaryminutes <dbl>, calories <dbl>
head(daily_sleep)## # A tibble: 6 × 5
## id date totalsleeprecords totalminutesasleep totaltimeinbed
## <dbl> <date> <dbl> <dbl> <dbl>
## 1 1503960366 2016-04-12 1 327 346
## 2 1503960366 2016-04-13 2 384 407
## 3 1503960366 2016-04-15 1 412 442
## 4 1503960366 2016-04-16 2 340 367
## 5 1503960366 2016-04-17 1 700 712
## 6 1503960366 2016-04-19 1 304 320
In order to transform the date strings into date-time format, we will perform a conversion for the “date” column in the hourly_steps dataset. This conversion will allow us to represent the dates and times in a standardized and consistent manner, facilitating further analysis and comparison across the dataset.
hourly_steps<- hourly_steps %>%
rename(date_time = activityhour) %>%
mutate(date_time = as.POSIXct(date_time,format ="%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))
head(hourly_steps)## # A tibble: 6 × 3
## id date_time steptotal
## <dbl> <dttm> <dbl>
## 1 1503960366 2016-04-12 00:00:00 373
## 2 1503960366 2016-04-12 01:00:00 160
## 3 1503960366 2016-04-12 02:00:00 151
## 4 1503960366 2016-04-12 03:00:00 0
## 5 1503960366 2016-04-12 04:00:00 0
## 6 1503960366 2016-04-12 05:00:00 0
To explore potential correlations between variables, we will merge the daily_activity and daily_sleep datasets. This merging process will be based on the common identifiers (id) and the date values (date) as the primary keys. By combining the relevant information from both datasets, we can analyze the relationship between different variables and gain insights into how they may be correlated.
daily_activity_sleep <- merge(daily_activity, daily_sleep, by=c ("id", "date"))
glimpse(daily_activity_sleep)## Rows: 410
## Columns: 18
## $ id <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ date <date> 2016-04-12, 2016-04-13, 2016-04-15, 2016-04-…
## $ totalsteps <dbl> 13162, 10735, 9762, 12669, 9705, 15506, 10544…
## $ totaldistance <dbl> 8.50, 6.97, 6.28, 8.16, 6.48, 9.88, 6.68, 6.3…
## $ trackerdistance <dbl> 8.50, 6.97, 6.28, 8.16, 6.48, 9.88, 6.68, 6.3…
## $ loggedactivitiesdistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ veryactivedistance <dbl> 1.88, 1.57, 2.14, 2.71, 3.19, 3.53, 1.96, 1.3…
## $ moderatelyactivedistance <dbl> 0.55, 0.69, 1.26, 0.41, 0.78, 1.32, 0.48, 0.3…
## $ lightactivedistance <dbl> 6.06, 4.71, 2.83, 5.04, 2.51, 5.03, 4.24, 4.6…
## $ sedentaryactivedistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ veryactiveminutes <dbl> 25, 21, 29, 36, 38, 50, 28, 19, 41, 39, 73, 3…
## $ fairlyactiveminutes <dbl> 13, 19, 34, 10, 20, 31, 12, 8, 21, 5, 14, 23,…
## $ lightlyactiveminutes <dbl> 328, 217, 209, 221, 164, 264, 205, 211, 262, …
## $ sedentaryminutes <dbl> 728, 776, 726, 773, 539, 775, 818, 838, 732, …
## $ calories <dbl> 1985, 1797, 1745, 1863, 1728, 2035, 1786, 177…
## $ totalsleeprecords <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ totalminutesasleep <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, …
## $ totaltimeinbed <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, …
Bellabeat, a tech-driven wellness company for women, has leveraged data on activity, sleep, stress, and reproductive health to empower women with knowledge about their own health and habits. With its rapid growth since its establishment in 2013, Bellabeat aims to shape its marketing strategy based on insights derived from analyzing FitBit Fitness Tracker Data.
The target audience for Bellabeat includes women who work full-time jobs and spend a significant amount of time engaged in sedentary activities such as computer work or meetings. Although these women engage in light activity to maintain their health, they need to improve their everyday activity levels to reap the full benefits of a healthy lifestyle. Providing knowledge about developing healthy habits and motivation to sustain their efforts could greatly benefit this audience.
The app should provide personalized daily step targets based on the user’s profile, lifestyle, and goals. Sending reminders to users who are falling behind their targets can help motivate them to stay active. Additionally, incorporating features like mini-games or wellness trivia can create a sense of reward and increase user engagement and retention.
The app can send alerts to users who remain seated or inactive for an extended period of time. This feature would be particularly useful for users who work from home and may forget to take breaks. Encouraging regular activity breaks can help improve overall health and combat sedentary behavior.
For users looking to improve their sleep, the app can recommend light activities before bedtime and alert users if their activity level is too intense based on their profile. Additionally, incorporating features to assist with meditation or relaxation techniques can help users wind down and prepare for better sleep quality.
Further investigation is needed to understand why the average wear time of the tracker decreases over time. Analyzing user feedback and conducting user surveys can provide insights into potential reasons for the decline. Additionally, considering features such as water-proof design, minimalist aesthetics, long battery life, and comfortable wear can help encourage users to wear the tracker consistently throughout the day.
By implementing these ideas, Bellabeat can further empower women to lead healthier lifestyles and achieve a balance between their personal and professional lives with the support of their app.
Social Networking and Team Goal Setting
Incorporating social networking features such as in-app chats or team goal setting can enhance user engagement and promote exercise habits. Studies have shown that social support interventions increase physical activity among adults. By fostering a sense of community and accountability, users can motivate and inspire each other to stay active.