Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company
1.What are some trends in smart device usage? 2.How could these trends apply to Bellabeat customers? 3.How could these trends help influence Bellabeat marketing strategy
Identify potential opportunities for growth and recommendations for the Bellabeat marketing strategy improvement based on trends in smart device usage.
In this case study analysis i will use R programming language for the data manipulation, exploration and visualization.
install.packages('tidyverse')
Error in install.packages : Updating loaded packages
library(tidyverse)
library(lubridate)
library(dplyr)
library(ggplot2)
library(tidyr)
Take a look at the activity data.
```{r}
Error: attempt to use zero-length variable name
Identify all the columns in the activity data.
colnames(activity)
[1] "Id" "ActivityDate" "TotalSteps" "TotalDistance"
[5] "TrackerDistance" "LoggedActivitiesDistance" "VeryActiveDistance" "ModeratelyActiveDistance"
[9] "LightActiveDistance" "SedentaryActiveDistance" "VeryActiveMinutes" "FairlyActiveMinutes"
[13] "LightlyActiveMinutes" "SedentaryMinutes" "Calories"
Take a look at the sleep data.
Identify all the columns in the sleep data.
colnames(sleep)
[1] "extra" "group" "ID"
#Activity
activity$ActivityDate=as.POSIXct(activity$ActivityDate, format="%m/%d/%Y", tz=Sys.timezone())
activity$date <- format(activity$ActivityDate, format = "%m/%d/%y")
# intensities
intensities$ActivityHour=as.POSIXct(intensities$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
intensities$time <- format(intensities$ActivityHour, format = "%H:%M:%S")
intensities$date <- format(intensities$ActivityHour, format = "%m/%d/%y")
# sleep
sleep$SleepDay=as.POSIXct(sleep$SleepDay, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
sleep$date <- format(sleep$SleepDay, format = "%m/%d/%y")
View the calories dataframe
calories=as.POSIXct(calories, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
Error in as.POSIXct.default(calories, format = "%m/%d/%Y %I:%M:%S %p", :
do not know how to convert 'calories' to class “POSIXct”
Note that the datasets have the ‘Id’ field - this can be used to merge the datasets.
How many unique participants are there in each dataframe?
n_distinct(activity$Id)
[1] 33
n_distinct(sleep$Id)
[1] 24
n_distinct(intensities$Id)
[1] 33
n_distinct(calories$Id)
[1] 14
n_distinct(weight$Id)
[1] 8
There is 33 participants in the activity and Intensities, 14 in calories data set and 24 in the sleep dataset and only 8 in the weight data set.
Let’s have a look at summary statistics of the data sets:
# observations
nrow(activity)
nrow(sleep)
nrow(intensities)
nrow(calories)
nrow(weight)
What are some quick summary statistics we’d want to know about each data frame?
For the activity dataframe:
#activity
activity %>%
select(TotalSteps,
TotalDistance,
SedentaryMinutes) %>%
summary()
TotalSteps TotalDistance SedentaryMinutes
Min. : 0 Min. : 0.000 Min. : 0.0
1st Qu.: 3790 1st Qu.: 2.620 1st Qu.: 729.8
Median : 7406 Median : 5.245 Median :1057.5
Mean : 7638 Mean : 5.490 Mean : 991.2
3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.:1229.5
Max. :36019 Max. :28.030 Max. :1440.0
# explore number of active minutes per category
activity %>%
select(VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes) %>%
summary()
VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes
Min. : 0.00 Min. : 0.00 Min. : 0.0
1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:127.0
Median : 4.00 Median : 6.00 Median :199.0
Mean : 21.16 Mean : 13.56 Mean :192.8
3rd Qu.: 32.00 3rd Qu.: 19.00 3rd Qu.:264.0
Max. :210.00 Max. :143.00 Max. :518.0
Average sedentary time is 991.2 minutes or 16.51 hours. Definately needs to be reduced!
The majority of the participants are lightly active .
Average total steps per day are 7638 which is a little bit less for healthy benefits.
According to the CDC research. They discovered that taking 8,000 steps per day was associated with a 51% lower risk for all-cause of mortality (or death from all causes). Taking 12,000 steps per day was associated with a 65% lower risk compared with taking 4,000 steps.
For the sleep dataframe:
sleep %>%
select(TotalSleepRecords,
TotalMinutesAsleep,
TotalTimeInBed) %>%
summary()
TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
Min. :1.000 Min. : 58.0 Min. : 61.0
1st Qu.:1.000 1st Qu.:361.0 1st Qu.:403.0
Median :1.000 Median :433.0 Median :463.0
Mean :1.119 Mean :419.5 Mean :458.6
3rd Qu.:1.000 3rd Qu.:490.0 3rd Qu.:526.0
Max. :3.000 Max. :796.0 Max. :961.0
What does this tell us about how this sample of people’s activities?
On an average participants sleeps over 6 hours and stays 7 hours on bed. Participant spend more time on bed than sleeping on an average.
Note that there were more participant Ids in the activity dataset that have been filtered out using merge.
Lets Take a look at how many participants are in this combined data set.
n_distinct(combined_data$Id)
[1] 24
There are 24 participants in the new combined dataset.
What’s the relationship between steps taken in a day and sedentary minutes?
How could this help inform the customer segments that we can market to? E.g. position this more as a way to get started in walking more? Or to measure steps that you’re already taking?
far less participants are taking more steps compared to sedimentary participants.The marketing team should position their campaign to encourage users to switch-on notifications to remind users to take more steps to achieve their health goals.
The relationship is completely linear - although there any unexpected trends.There are less participants spending more time on bed which could which suggest that this category of users have poor sleep.
What could these trends tell you about how to help market this product? Or areas where you might want to explore further?
Bellabeat users can improve their sleep by using notifications to remind them its time to go to sleep
###Let’s look at intensities data over time (hourly).
hourly_intensity <- intensities %>%
group_by(time) %>%
drop_na() %>%
summarise(mean_total_int = mean(TotalIntensity))
hourly_intensity %>%
ggplot(aes(x=time, y=mean_total_int)) +
geom_col(stat =
"identity", fill='steelblue') +
theme(axis.text.x = element_text(angle = 90)) +
labs(title="Average Total Intensity vs. Time")
Warning: Ignoring unknown parameters: `stat`
Meanwhile, Most activity happens between 5 pm and 7 pm - I suppose, that people go to a gym or for a walk after finishing work.
We can use this time in the Bellabeat app to remind and motivate users to go for a run or walk.
combined_data %>%
ggplot( aes(x=TotalMinutesAsleep, y=SedentaryMinutes)) +
geom_point(color='steelblue') + geom_smooth(color="red") +
labs(title="Sleep Time by Sedentary Time in Minutes",
x= "Sedimentary Time",
y="Total sleeping time")
The higher the sedimentary time the lower the sleep time. Here we can clearly see the negative relationship between Sedentary Minutes and Sleep time. Although this correlation may not equate causation.
Bellabeat app can recommend a reduction in sedimentary time in order to improve users sleep time.
Now you can explore some different relationships between activity and sleep as well. For example, do you think participants who sleep more also take more steps:
or fewer steps per day? Is there a relationship at all? How could these answers help inform the marketing strategy of how you position this new product?
combined_data %>%
ggplot( aes(x=TotalMinutesAsleep, y=VeryActiveDistance)) +
geom_point(color='steelblue') +
labs(title="Minutes Asleep by Active Distance",
x="Minutes Asleep",
y="Active Distance")
There are many participants sleeping than taking those taking steps on an average. The marketing team needs to encourage users to review their daily activities for recommendation where they could be lacking behind.Only few participants are taking more steps and achieving longer distance..
colnames(combined_data)
[1] "Id" "date" "SleepDay"
[4] "TotalSleepRecords" "TotalMinutesAsleep" "TotalTimeInBed"
[7] "ActivityDate" "TotalSteps" "TotalDistance"
[10] "TrackerDistance" "LoggedActivitiesDistance" "VeryActiveDistance"
[13] "ModeratelyActiveDistance" "LightActiveDistance" "SedentaryActiveDistance"
[16] "VeryActiveMinutes" "FairlyActiveMinutes" "LightlyActiveMinutes"
[19] "SedentaryMinutes" "Calories"
combined_data %>%
ggplot( aes(x=SleepDay)) +
geom_bar(bandwith = 5, alpha = 3, fill="darkblue")+
labs(title="Participants Asleep by Months",
x= "Sleep Day Months",
y="Count")
Warning: Ignoring unknown parameters: `bandwith`
The highest sleeping days were in the months of April.
Target audience are women
Women who work full-time jobs (according to the hourly intensity data) spend a lot of time on the computer or in a meeting and are focused on their jobs. (according to the sedentary time data visualization).
Less activity is recorded for most users with more sedentary time. Therefore since Bellabeat is a friendly app for women, a campaign that empowers women to balance full personal and professional life with healthy habits or routines through education and motivation through daily app recommendations is key.
Most activity happens between 5 pm and 7 pm - I suppose, that people go to a gym or for a walk after finishing work. Bellabeat can use this time to remind and motivate users to go for a run or walk.
Thank you for your interest reading and studying my analytical processes on Bellabeat Case Study!
This is my second project using R. I would appreciate any comments and recommendations for improvement!