Bellabeat is a high-tech manufacturer of health-focused products for women. They are looking for growth opportunities in the global smart device market.
Analyzing smart device usage data in order to gain insight into users’ daily habits related to physical activity, heart rate, and sleep patterns. These insights can be important to spot trends in how customers use smart devices, which then inform Bellabeat’s marketing strategies and product improvements. Specific questions need to be answer as followed: - What are some trends in smart device usage? - How could these trends apply to Bellabeat customers? - How could these trends help influence Bellabeat marketing strategy?
I will use R to explore, clean, analyze and visualize findings for this case study for the sake of convenience because R is a comprehensive tool for statistical computing, data analysis, and graphical visualization.
I want to do an analysis based upon daily data so I have uploaded 4 csv files containing tables at the day level.
Install and load the tidyverse.
install.packages('tidyverse')
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(ggplot2)
install.packages('corrplot')
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
library(corrplot)
## corrplot 0.95 loaded
Name the dataframes and take a look at the data. We need to find out how many unique participants there are in each dataframe.
daily_activity <- read_csv("dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
n_distinct(daily_activity$Id)
## [1] 33
colnames(daily_activity)
## [1] "Id" "ActivityDate"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories"
daily_calories <- read.csv("dailyCalories_merged.csv")
n_distinct(daily_calories$Id)
## [1] 33
head(daily_calories)
## Id ActivityDay Calories
## 1 1503960366 4/12/2016 1985
## 2 1503960366 4/13/2016 1797
## 3 1503960366 4/14/2016 1776
## 4 1503960366 4/15/2016 1745
## 5 1503960366 4/16/2016 1863
## 6 1503960366 4/17/2016 1728
daily_intensities <- read.csv("dailyIntensities_merged.csv")
n_distinct(daily_intensities$Id)
## [1] 33
str(daily_intensities)
## 'data.frame': 940 obs. of 10 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDay : chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ SedentaryMinutes : int 728 776 1218 726 773 539 1149 775 818 838 ...
## $ LightlyActiveMinutes : int 328 217 181 209 221 164 233 264 205 211 ...
## $ FairlyActiveMinutes : int 13 19 11 34 10 20 16 31 12 8 ...
## $ VeryActiveMinutes : int 25 21 30 29 36 38 42 50 28 19 ...
## $ SedentaryActiveDistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LightActiveDistance : num 6.06 4.71 3.91 2.83 5.04 ...
## $ ModeratelyActiveDistance: num 0.55 0.69 0.4 1.26 0.41 ...
## $ VeryActiveDistance : num 1.88 1.57 2.44 2.14 2.71 ...
sleep_day <- read_csv("sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
n_distinct(sleep_day$Id)
## [1] 24
colnames<- sleep_day
weight_log <- read.csv("weightLogInfo_merged.csv")
n_distinct(weight_log$Id)
## [1] 8
head(weight_log)
## Id Date WeightKg WeightPounds Fat BMI
## 1 1503960366 5/2/2016 11:59:59 PM 52.6 115.9631 22 22.65
## 2 1503960366 5/3/2016 11:59:59 PM 52.6 115.9631 NA 22.65
## 3 1927972279 4/13/2016 1:08:52 AM 133.5 294.3171 NA 47.54
## 4 2873212765 4/21/2016 11:59:59 PM 56.7 125.0021 NA 21.45
## 5 2873212765 5/12/2016 11:59:59 PM 57.3 126.3249 NA 21.69
## 6 4319703577 4/17/2016 11:59:59 PM 72.4 159.6147 25 27.45
## IsManualReport LogId
## 1 True 1.462234e+12
## 2 True 1.462320e+12
## 3 False 1.460510e+12
## 4 True 1.461283e+12
## 5 True 1.463098e+12
## 6 True 1.460938e+12
After a quick review, it is noted that: - All tables have ID columns, therefore we can merge the tables. - Information in daily_intensities and daily calories tables is already present in the daily activity so I dropped out these two dataframes. - There is only 8 participants logged in weight information so the weight data is not representative to look into. I also dropped out this table.
Now we have 2 remaining tables to explore. Let’s summarize some statistics figures in each dataframe.
daily_activity %>%
select(TotalSteps,
TotalDistance,
SedentaryMinutes) %>%
summary()
## TotalSteps TotalDistance SedentaryMinutes
## Min. : 0 Min. : 0.000 Min. : 0.0
## 1st Qu.: 3790 1st Qu.: 2.620 1st Qu.: 729.8
## Median : 7406 Median : 5.245 Median :1057.5
## Mean : 7638 Mean : 5.490 Mean : 991.2
## 3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.:1229.5
## Max. :36019 Max. :28.030 Max. :1440.0
sleep_day %>%
select(TotalSleepRecords,
TotalMinutesAsleep,
TotalTimeInBed) %>%
summary()
## TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## Min. :1.000 Min. : 58.0 Min. : 61.0
## 1st Qu.:1.000 1st Qu.:361.0 1st Qu.:403.0
## Median :1.000 Median :433.0 Median :463.0
## Mean :1.119 Mean :419.5 Mean :458.6
## 3rd Qu.:1.000 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :3.000 Max. :796.0 Max. :961.0
Let’s parse dates to facilitate merging tables and joining them.
activity <- daily_activity %>%
mutate(ActivityDate = mdy(ActivityDate))
sleep <- sleep_day %>%
mutate(SleepDay = mdy_hms(SleepDay)) %>%
mutate(SleepDay = as_date(SleepDay))
merged_data <- inner_join(activity, sleep,
by = c("Id" = "Id", "ActivityDate" = "SleepDay"))
Now we have a merged dataframe. We need to remove duplicates or null values.
merged_data <- merged_data %>%
select(Id, ActivityDate, TotalSteps, SedentaryMinutes, Calories,
TotalMinutesAsleep, TotalTimeInBed)
merged_data <- merged_data %>%
distinct() %>%
drop_na()
Increase Moderate Physical Activity: Engaging in more daily steps may help improve sleep duration and quality.
Reduce Sedentary Time Long periods of inactivity are slightly linked to reduced sleep.
Monitor Consistency: Establishing consistent activity and sleep patterns could be more beneficial than just daily totals.
Based on the insights gained from the analysis phase, I think it would be suitable to apply the findings to improve Bellabeat app which provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products such as Leaf and Time so the company can promote these two products as well. Some ideas for the app are followed:
Thank you for your time and attention!