Scenario: You are a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, co-founder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. You have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights you discover will then help guide marketing strategy for the company. You will present your analysis to the Bellabeat executive team along with your high-level recommendations for Bellabeat’s marketing strategy.
– Ask
– Prepare
– Process
– Analyze
– Share
– Act
These are the questions our stakeholders want me to answer:
On the prepare stage of the analysis, I am going to be using the data provided by the company. Part of the preparation is to determine the credibility of the data provided and choosing which data sets I would like to use for the analysis.
Determining the credibility:
To determine the credibility of the data, I will be using the ROCCC concept. These are 5 criteria a data should meet to assess whether the data is good or bad. ROCCC stands for reliability, original, comprehensive, current and cited.
Reliability: The data is from 33 Fitbit users who consented to the submission of personal data tracker generated from the Amazon Mechanical Turk. It is a small and dated data set which means that the analysis is outdated may be biased.
Original: The data was collected from 33 Fitbit devices with the consent of the users. The users understood how the data was collected, stored, managed and protected.
Comprehensive: The data contains a lot of information, and it needs to be cleaned by removing the nulls, duplicates and 0 values.
Current: The data set was collected from 04/12/2016 – 05/12/2016 which means that the data is not current.
Cited: Full information in regards to the source of the data https://www.kaggle.com/datasets/arashnic/fitbit.
Upon examining the data, I decided that I will be focusing on 3 data sets out of the 18 provided. This case study will be focused on the weight log info, daily activity merged, sleep day merged.
1.) Upon examining the data I chose to work with, I decided to pre-format and pre-clean the data on Excel before working on it in RStudio. I decided to format all the dates into short dates because I will not be needing the hours that are in the date column.
2.) I started by installing and loading the necessary packages I need to process the data.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.1 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(here)
## here() starts at /cloud/project
library(skimr)
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(dplyr)
library(lubridate)
library(ggplot2)
library(readr)
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
3.) I imported the data sets I pre-cleaned and pre-formatted.
daily_activity <- read_csv("/cloud/project/daily_activity_merged_new_date.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sleep_activity <- read_csv("/cloud/project/sleep_day_merged_new_date.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
weight_info <- read_csv("/cloud/project/weight_log_info_merged_new_date.csv")
## Rows: 67 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
## lgl (1): IsManualReport
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
4.) I will now start cleaning the data in RStudio.
Using the colnames(), I checked if the name formats are uniform.
colnames(daily_activity)
## [1] "Id" "ActivityDate"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories"
colnames(sleep_activity)
## [1] "Id" "SleepDay" "TotalSleepRecords"
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
colnames(weight_info)
## [1] "Id" "Date" "WeightKg" "WeightPounds"
## [5] "Fat" "BMI" "IsManualReport" "LogId"
Upon examining the data, I noticed that the name of the columns are not uniform. I used the clean_names() to replace spaces, special characters, and uppercase letters with underscores and lowercase letters, which makes the column names easier to work with in R.
sleep_activity <- clean_names(sleep_activity)
daily_activity <- clean_names(daily_activity)
weight_info <- clean_names(weight_info)
Upon further examination, I noticed that the date columns of the different data sets have different names. To make the merging of the data sets easier and cleaner, I will use the rename() to have a uniform name title for the date column for all three data sets. I decided to name the date column as “activity_date”.
sleep_activity <- sleep_activity %>% rename(activity_date = sleep_day)
daily_activity <- daily_activity %>% rename(activity_date = activity_date)
weight_info <- weight_info %>% rename(activity_date = date)
I double checked if the all the column names are uniform by using the colnames() again.
colnames(daily_activity)
## [1] "id" "activity_date"
## [3] "total_steps" "total_distance"
## [5] "tracker_distance" "logged_activities_distance"
## [7] "very_active_distance" "moderately_active_distance"
## [9] "light_active_distance" "sedentary_active_distance"
## [11] "very_active_minutes" "fairly_active_minutes"
## [13] "lightly_active_minutes" "sedentary_minutes"
## [15] "calories"
colnames(sleep_activity)
## [1] "id" "activity_date" "total_sleep_records"
## [4] "total_minutes_asleep" "total_time_in_bed"
colnames(weight_info)
## [1] "id" "activity_date" "weight_kg" "weight_pounds"
## [5] "fat" "bmi" "is_manual_report" "log_id"
I double-checked the output summary of my data set using the str().
str(daily_activity)
## spc_tbl_ [940 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ activity_date : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ total_steps : num [1:940] 13162 10735 10460 9762 12669 ...
## $ total_distance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ tracker_distance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ logged_activities_distance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ very_active_distance : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
## $ moderately_active_distance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
## $ light_active_distance : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
## $ sedentary_active_distance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ very_active_minutes : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
## $ fairly_active_minutes : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
## $ lightly_active_minutes : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
## $ sedentary_minutes : num [1:940] 728 776 1218 726 773 ...
## $ calories : num [1:940] 1985 1797 1776 1745 1863 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. ActivityDate = col_character(),
## .. TotalSteps = col_double(),
## .. TotalDistance = col_double(),
## .. TrackerDistance = col_double(),
## .. LoggedActivitiesDistance = col_double(),
## .. VeryActiveDistance = col_double(),
## .. ModeratelyActiveDistance = col_double(),
## .. LightActiveDistance = col_double(),
## .. SedentaryActiveDistance = col_double(),
## .. VeryActiveMinutes = col_double(),
## .. FairlyActiveMinutes = col_double(),
## .. LightlyActiveMinutes = col_double(),
## .. SedentaryMinutes = col_double(),
## .. Calories = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(sleep_activity)
## spc_tbl_ [413 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:413] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ activity_date : chr [1:413] "4/12/2016" "4/13/2016" "4/15/2016" "4/16/2016" ...
## $ total_sleep_records : num [1:413] 1 2 1 2 1 1 1 1 1 1 ...
## $ total_minutes_asleep: num [1:413] 327 384 412 340 700 304 360 325 361 430 ...
## $ total_time_in_bed : num [1:413] 346 407 442 367 712 320 377 364 384 449 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. SleepDay = col_character(),
## .. TotalSleepRecords = col_double(),
## .. TotalMinutesAsleep = col_double(),
## .. TotalTimeInBed = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(weight_info)
## spc_tbl_ [67 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:67] 1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
## $ activity_date : chr [1:67] "5/2/2016" "5/3/2016" "4/13/2016" "4/21/2016" ...
## $ weight_kg : num [1:67] 52.6 52.6 133.5 56.7 57.3 ...
## $ weight_pounds : num [1:67] 116 116 294 125 126 ...
## $ fat : num [1:67] 22 NA NA NA NA 25 NA NA NA NA ...
## $ bmi : num [1:67] 22.6 22.6 47.5 21.5 21.7 ...
## $ is_manual_report: logi [1:67] TRUE TRUE FALSE TRUE TRUE TRUE ...
## $ log_id : num [1:67] 1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. Date = col_character(),
## .. WeightKg = col_double(),
## .. WeightPounds = col_double(),
## .. Fat = col_double(),
## .. BMI = col_double(),
## .. IsManualReport = col_logical(),
## .. LogId = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
Upon examination of the data, I noticed that the activity_date column of the different data sets are in character format. Therefore I need to reformat them into date formats by using the mdy() that is part of the lubridate package.I also used the head() to double-check if the character format on my activity_date column was transformed into date format.
daily_activity$activity_date <- mdy(daily_activity$activity_date)
head(daily_activity)
## # A tibble: 6 × 15
## id activity_date total_steps total_distance tracker_distance
## <dbl> <date> <dbl> <dbl> <dbl>
## 1 1503960366 2016-04-12 13162 8.5 8.5
## 2 1503960366 2016-04-13 10735 6.97 6.97
## 3 1503960366 2016-04-14 10460 6.74 6.74
## 4 1503960366 2016-04-15 9762 6.28 6.28
## 5 1503960366 2016-04-16 12669 8.16 8.16
## 6 1503960366 2016-04-17 9705 6.48 6.48
## # ℹ 10 more variables: logged_activities_distance <dbl>,
## # very_active_distance <dbl>, moderately_active_distance <dbl>,
## # light_active_distance <dbl>, sedentary_active_distance <dbl>,
## # very_active_minutes <dbl>, fairly_active_minutes <dbl>,
## # lightly_active_minutes <dbl>, sedentary_minutes <dbl>, calories <dbl>
sleep_activity$activity_date <- mdy(sleep_activity$activity_date)
head(sleep_activity)
## # A tibble: 6 × 5
## id activity_date total_sleep_records total_minutes_asleep
## <dbl> <date> <dbl> <dbl>
## 1 1503960366 2016-04-12 1 327
## 2 1503960366 2016-04-13 2 384
## 3 1503960366 2016-04-15 1 412
## 4 1503960366 2016-04-16 2 340
## 5 1503960366 2016-04-17 1 700
## 6 1503960366 2016-04-19 1 304
## # ℹ 1 more variable: total_time_in_bed <dbl>
weight_info$activity_date <- mdy(weight_info$activity_date)
head(weight_info)
## # A tibble: 6 × 8
## id activity_date weight_kg weight_pounds fat bmi is_manual_report
## <dbl> <date> <dbl> <dbl> <dbl> <dbl> <lgl>
## 1 1503960366 2016-05-02 52.6 116. 22 22.6 TRUE
## 2 1503960366 2016-05-03 52.6 116. NA 22.6 TRUE
## 3 1927972279 2016-04-13 134. 294. NA 47.5 FALSE
## 4 2873212765 2016-04-21 56.7 125. NA 21.5 TRUE
## 5 2873212765 2016-05-12 57.3 126. NA 21.7 TRUE
## 6 4319703577 2016-04-17 72.4 160. 25 27.5 TRUE
## # ℹ 1 more variable: log_id <dbl>
I identified and verified each number of data frame by id using the n_unique() using the id column of each data sets.
n_unique(daily_activity$id)
## [1] 33
n_unique(sleep_activity$id)
## [1] 24
n_unique(weight_info$id)
## [1] 8
I tried to identify and verify if there are any duplicates in the data sets being used by using the sum() and duplicated()
sum(duplicated(daily_activity))
## [1] 0
sum(duplicated(sleep_activity))
## [1] 3
sum(duplicated(weight_info))
## [1] 0
The function for the sleep_activity data frame returned that there are 3 duplicates. To verify if they are actual duplicates, meaning all the columns and rows had the same entry, I will be using the get_dupes () which is a part of the janitor package.
get_dupes(sleep_activity)
## No variable names specified - using all columns.
## # A tibble: 6 × 6
## id activity_date total_sleep_records total_minutes_asleep
## <dbl> <date> <dbl> <dbl>
## 1 4388161847 2016-05-05 1 471
## 2 4388161847 2016-05-05 1 471
## 3 4702921684 2016-05-07 1 520
## 4 4702921684 2016-05-07 1 520
## 5 8378563200 2016-04-25 1 388
## 6 8378563200 2016-04-25 1 388
## # ℹ 2 more variables: total_time_in_bed <dbl>, dupe_count <int>
After verifying that the data is an actual duplicate, I will be using the distinct() to remove all duplicates and double-checking them by using the sum() and duplicated function.
sleep_activity <- distinct(sleep_activity, .keep_all = TRUE)
sum(duplicated(sleep_activity))
## [1] 0
After cleaning, processing, and analyzing the data that was provided, I concluded that not a lot of people understand the importance of being physically active. According to the trends that were shown, the users spent a lot of time in sedentary state. This might be because of their job or school, but people need to be reminded that they need to move their body to be healthy. These trends apply to Bellabeat’s customers because we can improve our products by setting up reminders to stand up, take a walk, and maybe even stretch. An hourly reminder from the app to remind the customers to have healthier habits can help us improve sales. Promoting a healthy lifestyle is not only a good marketing strategy, but it is also a great way to advertise self-care.
1.) According to the trends, the users were inconsistent with tracking their weight. Depending on the user’s goals, we can set up a reminder for the user to input their weight every week if their goal is to lose or gain weight. That way they can track their improvements.
2.) According to the data provided, a lot of people do not get enough sleep. We can set up reminders for the users to start getting ready for bed. Having the option to play relaxing music from the app and even reminding them to sleep.
3.) The company can also invest in improving their app to expand its abilities such as having the option to personalize the user’s goals. Examples would be giving them the option to record what they eat, counting their calories depending on the workout they did like yoga, outdoor walk, kayaking, playing tennis etc. This encourages the users to wear their devices everywhere all the time because it will count all the calories they are burning regardless of the activity.
4.) The company could also invest in adding additional features to the devices such as being able to track the location and or the ability to call 911 if there is an emergency.
5.) The company can also invest in improving the battery life of the devices. This will encourage the users to wear it more because they would not have to worry about the battery dying or re-charging it often.
6.) The company should also invest in the quality and comfort of the products since a lot of users wear the devices on majority of the day. The users can focus more on their health and lifestyle and worry less about the devices being scratched or breaking during activities.
If all the recommendations were acted on, the marketing team can advertise all these improvements on our devices which could help us boost the sales of the company. We could also use user-generated content creators and bloggers to help us with marketing. According to research a lot of consumers establish greater trust between customer and brand with user-generated content creators. It is real and honest content that the potential customer can trust, as it is not the brand’s own content, but is based on the experience that other users have had with the products.
Danielle Pacheco - March 2, 2023
https://www.sleepfoundation.org/how-sleep-works/why-do-we-need-sleep
Division of Nutrition, Physical Activity and Obesity, National Center for Chronic Disease Prevention and Health Promotion – June 16, 2022
Subir Biswas - January 4, 2022