Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused smart products. Sršen used her background as an artist to develop beautifully designed technology that informs and inspires women around the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits. Since it was founded in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women.
By 2016, Bellabeat had opened offices around the world and launched multiple products. Bellabeat products became available through a growing number of online retailers in addition to their own e-commerce channel on their website. The company has invested in traditional advertising media, such as radio, out-of-home billboards, print, and television, but focuses on digital marketing extensively. Bellabeat invests year-round in Google Search, maintaining active Facebook and Instagram pages, and consistently engages consumers on Twitter. Additionally, Bellabeat runs video ads on Youtube and display ads on the Google Display Network to support campaigns around key marketing dates.
The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
Bellabeat is looking for to identify the trend on how consumers use smart devices and the available business growth opportunity. Additionally, to come up high-level recommendations to use on marketing strategy.
The main objectives of the case study is based on these three business question which underline the scope of the study: - What are the trends identified? - How could these trends apply to Bellabeat consumers - How could these trends help influence Bellabeat marketing strategy?
Bellabeat has key stakeholders who are interesting to obtain solution on business tasks company involved: - Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer. - Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team - Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy.
in here we go through data exploration, where the data was stored and how data was verified the ROCCC method, checking the data licencing, privacy, security, accessibility and protected its integrity. Furthermore, we will highlight on how data help us to answer business questions.
This case study data is available in popular public website Kaggle. This dataset generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Individual reports can be parsed by export session ID (column A) or timestamp (column B). Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors / preferences.
High Quality data can be help us to determine reliable decisions. To obtain this data we need to check the quality of our data. ROCCC method will show us how are data quality is.
Bellabeat have set the standards collected, shared and used this data. Bellabeat has kept the privacy and the validity of this data. Fit Bit dataset meets the six elements of data ethics: ownership, transaction transparency, consent, currency, privacy and openness.
FitBit data is not clean data to process this data, we will go through, dataset files and check data variables and observations, will sort and filter data, remove missing data, change column names and prepare clean dataset.
Bellabeat Fitness App case study was used R program to clean and analysis data. R Program is one of the best data analysis programming language, which originally created for statistical analysis purpose.
setwd("~/Desktop/RPROJECTS/Fitness")
R program use several library to speed up data analysis process. In this capstone, we will use the following pacakges.
library("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.8
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 2.1.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library("skimr")
library("here")
## here() starts at /Users/mohamedabdilahi/Desktop/RPROJECTS/Fitness
library("lubridate")
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library("janitor")
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library("dplyr")
library("scales")
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library("ggpubr")
There are a number of CSV files in fitbit data set. we are only going to analysis three most important which are daily_activity, Sleep and hourly steps. As to explore the data we need to import these dataset into our environment.
activity <- read_csv("~/Desktop/RPROJECTS/Fitness/fitabase_data/dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sleep <- read_csv("~/Desktop/RPROJECTS/Fitness/fitabase_data/sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
steps <- read_csv("~/Desktop/RPROJECTS/Fitness/fitabase_data/hourlySteps_merged.csv")
## Rows: 22099 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (2): Id, StepTotal
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
let us explore data and check the competence of data. We have uploaded three dataset weight, daily_activities and sleepDay. Before, we go analyse data, we need to clean it and remove, duplicates, missing values, and format any column need to be formatted.
In this stage, we are looking the overall of our dataset and Identify, if there are some missing values, duplicates and data types
We have determined that this data need to clean and make tidy. At first, we will look the number of users in this data should be 30 users approximate, but we hope it may be greater or less than few numbers.
n_unique(activity$Id)
## [1] 33
n_unique(sleep$Id)
## [1] 24
n_unique(steps$Id)
## [1] 33
sum(duplicated(activity))
## [1] 0
sum(duplicated(sleep))
## [1] 3
sum(duplicated(steps))
## [1] 0
We found that there are 3 duplicate observation in daily_sleep. Now, let us remove the duplicated using this
sleep <- sleep %>%
distinct() %>%
drop_na()
Now let us check whether or not removed the duplicates.
sum(duplicated(sleep))
## [1] 0
Final data can have upper and lowers letters, this can create confusion in the data analysis process. So it is best practice to covert all your column names into lower letters.
# Daily Activity datasets
clean_names(activity)
## # A tibble: 940 × 15
## id activity_date total_steps total_distance tracker_distance
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 13162 8.5 8.5
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## 7 1503960366 4/18/2016 13019 8.59 8.59
## 8 1503960366 4/19/2016 15506 9.88 9.88
## 9 1503960366 4/20/2016 10544 6.68 6.68
## 10 1503960366 4/21/2016 9819 6.34 6.34
## # … with 930 more rows, and 10 more variables:
## # logged_activities_distance <dbl>, very_active_distance <dbl>,
## # moderately_active_distance <dbl>, light_active_distance <dbl>,
## # sedentary_active_distance <dbl>, very_active_minutes <dbl>,
## # fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
## # sedentary_minutes <dbl>, calories <dbl>
activity<- rename_with(activity, tolower)
# Daily Sleep datasets
clean_names(sleep)
## # A tibble: 410 × 5
## id sleep_day total_sleep_rec… total_minutes_a… total_time_in_b…
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:0… 1 327 346
## 2 1503960366 4/13/2016 12:0… 2 384 407
## 3 1503960366 4/15/2016 12:0… 1 412 442
## 4 1503960366 4/16/2016 12:0… 2 340 367
## 5 1503960366 4/17/2016 12:0… 1 700 712
## 6 1503960366 4/19/2016 12:0… 1 304 320
## 7 1503960366 4/20/2016 12:0… 1 360 377
## 8 1503960366 4/21/2016 12:0… 1 325 364
## 9 1503960366 4/23/2016 12:0… 1 361 384
## 10 1503960366 4/24/2016 12:0… 1 430 449
## # … with 400 more rows
sleep <- rename_with(sleep, tolower)
# Hourly Steps datasets
clean_names(steps)
## # A tibble: 22,099 × 3
## id activity_hour step_total
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 373
## 2 1503960366 4/12/2016 1:00:00 AM 160
## 3 1503960366 4/12/2016 2:00:00 AM 151
## 4 1503960366 4/12/2016 3:00:00 AM 0
## 5 1503960366 4/12/2016 4:00:00 AM 0
## 6 1503960366 4/12/2016 5:00:00 AM 0
## 7 1503960366 4/12/2016 6:00:00 AM 0
## 8 1503960366 4/12/2016 7:00:00 AM 0
## 9 1503960366 4/12/2016 8:00:00 AM 250
## 10 1503960366 4/12/2016 9:00:00 AM 1864
## # … with 22,089 more rows
steps <- rename_with(steps, tolower)
Date and Time are very important in this data process, because what we are going to analysis the daily activities records. So if we do not change property Date and Time, your data will not be correct.
As we have seen the in daily_activity and daily_sleep, the columns activitydate and sleepDay are character data type. Let us convert into format.
activity <- activity %>%
rename(date = activitydate) %>%
mutate(date = as_date(date, format = "%m/%d/%Y"))
sleep <- sleep %>%
rename(date = sleepday) %>%
mutate(date = as_date(date, format = "%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone()))
## Warning: `tz` argument is ignored by `as_date()`
steps <- steps %>%
rename(date_time = activityhour) %>%
mutate(date_time = as.POSIXct(date_time, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone()))
str(activity)
## tibble [940 × 15] (S3: tbl_df/tbl/data.frame)
## $ id : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ date : Date[1:940], format: "2016-04-12" "2016-04-13" ...
## $ totalsteps : num [1:940] 13162 10735 10460 9762 12669 ...
## $ totaldistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ trackerdistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ loggedactivitiesdistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ veryactivedistance : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
## $ moderatelyactivedistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
## $ lightactivedistance : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
## $ sedentaryactivedistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ veryactiveminutes : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
## $ fairlyactiveminutes : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
## $ lightlyactiveminutes : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
## $ sedentaryminutes : num [1:940] 728 776 1218 726 773 ...
## $ calories : num [1:940] 1985 1797 1776 1745 1863 ...
str(sleep)
## tibble [410 × 5] (S3: tbl_df/tbl/data.frame)
## $ id : num [1:410] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ date : Date[1:410], format: "2016-04-12" "2016-04-13" ...
## $ totalsleeprecords : num [1:410] 1 2 1 2 1 1 1 1 1 1 ...
## $ totalminutesasleep: num [1:410] 327 384 412 340 700 304 360 325 361 430 ...
## $ totaltimeinbed : num [1:410] 346 407 442 367 712 320 377 364 384 449 ...
We arrived the last stage of data processing, Now we are combined to dataset to examine the relations between daily_activity and daily_sleep.
activity_sleep <- merge(activity, sleep, by=c("id", "date"))
glimpse(activity_sleep)
## Rows: 410
## Columns: 18
## $ id <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ date <date> 2016-04-12, 2016-04-13, 2016-04-15, 2016-04-…
## $ totalsteps <dbl> 13162, 10735, 9762, 12669, 9705, 15506, 10544…
## $ totaldistance <dbl> 8.50, 6.97, 6.28, 8.16, 6.48, 9.88, 6.68, 6.3…
## $ trackerdistance <dbl> 8.50, 6.97, 6.28, 8.16, 6.48, 9.88, 6.68, 6.3…
## $ loggedactivitiesdistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ veryactivedistance <dbl> 1.88, 1.57, 2.14, 2.71, 3.19, 3.53, 1.96, 1.3…
## $ moderatelyactivedistance <dbl> 0.55, 0.69, 1.26, 0.41, 0.78, 1.32, 0.48, 0.3…
## $ lightactivedistance <dbl> 6.06, 4.71, 2.83, 5.04, 2.51, 5.03, 4.24, 4.6…
## $ sedentaryactivedistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ veryactiveminutes <dbl> 25, 21, 29, 36, 38, 50, 28, 19, 41, 39, 73, 3…
## $ fairlyactiveminutes <dbl> 13, 19, 34, 10, 20, 31, 12, 8, 21, 5, 14, 23,…
## $ lightlyactiveminutes <dbl> 328, 217, 209, 221, 164, 264, 205, 211, 262, …
## $ sedentaryminutes <dbl> 728, 776, 726, 773, 539, 775, 818, 838, 732, …
## $ calories <dbl> 1985, 1797, 1745, 1863, 1728, 2035, 1786, 177…
## $ totalsleeprecords <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ totalminutesasleep <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, …
## $ totaltimeinbed <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, …
We are going to extract from the data the insights of Bellabeat fitbit users usage and know how the company determine the trend of the market.
According to the 10000Steps (2022), activity trackers provide data which enables you to become aware of your physical activity levels, work towards a goal and monitor progress. Studies using the 10,000 steps per day goal have shown weight loss, improved glucose tolerance, and reduced blood pressure from increased physical activity toward achieving this goal. The following pedometer indices have been developed to provide a guideline on steps and activity levels:
Although the program promotes the goal of reaching 10,000 steps each day for healthy adults, this goal is not universally appropriate across all ages and physical function. There are some groups where the goal of 10,000 steps may not be accurate, such as the elderly and children. Your individual step goal should be based on current activity levels and overall health and fitness goals. For people who normally do fewer than 10,000 steps, increasing daily activity by 1-2,000 steps per day will provide health benefits.
ggplot(data = activity, aes(x=totalsteps, y=calories, fill = totalsteps))+
geom_point() + geom_smooth() + labs(title = "Total Steps vs Calories")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The findings show that there is correlations between Total Steps and Calories. It is known, when you walk long you burn more calories.
daily_average <- activity_sleep %>%
group_by(id) %>%
summarise (mean_daily_steps = mean(totalsteps), mean_daily_calories = mean(calories), mean_daily_sleep = mean(totalminutesasleep))
head(daily_average)
## # A tibble: 6 × 4
## id mean_daily_steps mean_daily_calories mean_daily_sleep
## <dbl> <dbl> <dbl> <dbl>
## 1 1503960366 12406. 1872. 360.
## 2 1644430081 7968. 2978. 294
## 3 1844505072 3477 1676. 652
## 4 1927972279 1490 2316. 417
## 5 2026352035 5619. 1541. 506.
## 6 2320127002 5079 1804 61
Now, let the classify our users by daily average steps:
user_type <- daily_average %>%
mutate(user_type = case_when(
mean_daily_steps < 5000 ~ "sedentary",
mean_daily_steps >= 5000 & mean_daily_steps < 7499 ~ "lightly active",
mean_daily_steps >= 7500 & mean_daily_steps < 9999 ~ "fairly active",
mean_daily_steps >= 10000 ~ "very active"
))
head(user_type)
## # A tibble: 6 × 5
## id mean_daily_steps mean_daily_calories mean_daily_sleep user_type
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1503960366 12406. 1872. 360. very active
## 2 1644430081 7968. 2978. 294 fairly active
## 3 1844505072 3477 1676. 652 sedentary
## 4 1927972279 1490 2316. 417 sedentary
## 5 2026352035 5619. 1541. 506. lightly acti…
## 6 2320127002 5079 1804 61 lightly acti…
Now that we have a new column with the user type we will create a data frame with the percentage of each user type to better visualize them on a graph.
user_type_percent <- user_type %>%
group_by(user_type) %>%
summarise(total = n()) %>%
mutate(totals = sum(total)) %>%
group_by(user_type) %>%
summarise(total_percent = total / totals) %>%
mutate(labels = scales::percent(total_percent))
user_type_percent$user_type <- factor(user_type_percent$user_type , levels = c("very active", "fairly active", "lightly active", "sedentary"))
head(user_type_percent)
## # A tibble: 4 × 3
## user_type total_percent labels
## <fct> <dbl> <chr>
## 1 fairly active 0.375 38%
## 2 lightly active 0.208 21%
## 3 sedentary 0.208 21%
## 4 very active 0.208 21%
Below we can see that users are fairly distributed by their activity considering the daily amount of steps. We can determine that based on users activity all kind of users wear smart-devices.
user_type_percent %>%
ggplot(aes(x="",y=total_percent, fill=user_type)) +
geom_bar(stat = "identity", width = 1)+
coord_polar("y", start=0)+
theme_minimal()+
theme(axis.title.x= element_blank(),
axis.title.y = element_blank(),
panel.border = element_blank(),
panel.grid = element_blank(),
axis.ticks = element_blank(),
axis.text.x = element_blank(),
plot.title = element_text(hjust = 0.5, size=14, face = "bold")) +
scale_fill_manual(values = c("#85e085","#e6e600", "#ffd480", "#ff8080")) +
geom_text(aes(label = labels),
position = position_stack(vjust = 0.5))+
labs(title="User type distribution")
We want to know now what days of the week are the users more active and also what days of the week users sleep more. We will also verify if the users walk the recommended amount of steps and have the recommended amount of sleep.
Below we are calculating the weekdays based on our column date. We are also calculating the average steps walked and minutes sleeped by weekday.
weekday_steps_sleep <- activity_sleep %>%
mutate(weekday = weekdays(date))
weekday_steps_sleep$weekday <-ordered(weekday_steps_sleep$weekday, levels=c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday"))
weekday_steps_sleep <-weekday_steps_sleep%>%
group_by(weekday) %>%
summarize (daily_steps = mean(totalsteps), daily_sleep = mean(totalminutesasleep))
head(weekday_steps_sleep)
## # A tibble: 6 × 3
## weekday daily_steps daily_sleep
## <ord> <dbl> <dbl>
## 1 Monday 9273. 420.
## 2 Tuesday 9183. 405.
## 3 Wednesday 8023. 435.
## 4 Thursday 8184. 401.
## 5 Friday 7901. 405.
## 6 Saturday 9871. 419.
ggarrange(
ggplot(weekday_steps_sleep) +
geom_col(aes(weekday, daily_steps), fill = "#006699") +
geom_hline(yintercept = 7500) +
labs(title = "Daily steps per weekday", x= "", y = "") +
theme(axis.text.x = element_text(angle = 45,vjust = 0.5, hjust = 1)),
ggplot(weekday_steps_sleep, aes(weekday, daily_sleep)) +
geom_col(fill = "#85e0e0") +
geom_hline(yintercept = 480) +
labs(title = "Minutes asleep per weekday", x= "", y = "") +
theme(axis.text.x = element_text(angle = 45,vjust = 0.5, hjust = 1))
)
In the graphs above we can determine the following:
Users walk daily the recommended amount of steps of 7500 besides Sunday’s.
Users don’t sleep the recommended amount of minutes/ hours - 8 hours.
Getting deeper into our analysis we want to know when exactly are users more active in a day.
We will use the hourly_steps data frame and separate date_time column.
head(steps)
## # A tibble: 6 × 3
## id date_time steptotal
## <dbl> <dttm> <dbl>
## 1 1503960366 2016-04-12 00:00:00 373
## 2 1503960366 2016-04-12 01:00:00 160
## 3 1503960366 2016-04-12 02:00:00 151
## 4 1503960366 2016-04-12 03:00:00 0
## 5 1503960366 2016-04-12 04:00:00 0
## 6 1503960366 2016-04-12 05:00:00 0
steps <- steps %>%
separate(date_time, into = c("date", "time"), sep= " ") %>%
mutate(date = ymd(date))
head(steps)
## # A tibble: 6 × 4
## id date time steptotal
## <dbl> <date> <chr> <dbl>
## 1 1503960366 2016-04-12 00:00:00 373
## 2 1503960366 2016-04-12 01:00:00 160
## 3 1503960366 2016-04-12 02:00:00 151
## 4 1503960366 2016-04-12 03:00:00 0
## 5 1503960366 2016-04-12 04:00:00 0
## 6 1503960366 2016-04-12 05:00:00 0
steps %>%
group_by(time) %>%
summarize(average_steps = mean(steptotal)) %>%
ggplot() +
geom_col(mapping = aes(x=time, y = average_steps, fill = average_steps)) +
labs(title = "Hourly steps throughout the day", x="", y="") +
scale_fill_gradient(low = "green", high = "red")+
theme(axis.text.x = element_text(angle = 90))
According to Macarena, Lacasa (2021) we can see that users are more active between 8am and 7pm. Walking more steps during lunch time from 12pm to 2pm and evenings from 5pm and 7pm.
We will now determine if there is any correlation between different variables:
ggarrange(
ggplot(activity_sleep, aes(x=totalsteps, y=totalminutesasleep))+
geom_jitter() +
geom_smooth(color = "red") +
labs(title = "Daily steps vs Minutes asleep", x = "Daily steps", y= "Minutes asleep") +
theme(panel.background = element_blank(),
plot.title = element_text( size=14)),
ggplot(activity_sleep, aes(x=totalsteps, y=calories))+
geom_jitter() +
geom_smooth(color = "red") +
labs(title = "Daily steps vs Calories", x = "Daily steps", y= "Calories") +
theme(panel.background = element_blank(),
plot.title = element_text( size=14))
)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Per our plots:
There is no correlation between daily activity level based on steps and the amount of minutes users sleep a day.
Otherwise we can see a positive correlation between steps and calories burned. As assumed the more steps walked the more calories may be burned.
Now that we have seen some trends in activity, sleep and calories burned, we want to see how often do the users in our sample use their device. That way we can plan our marketing strategy and see what features would benefit the use of smart devices.
We will calculate the number of users that use their smart device on a daily basis, classifying our sample into three categories knowing that the date interval is 31 days:
high use - users who use their device between 21 and 31 days. moderate use - users who use their device between 10 and 20 days. low use - users who use their device between 1 and 10 days. First we will create a new data frame grouping by Id, calculating number of days used and creating a new column with the classification explained above.
daily_use <- activity_sleep %>%
group_by(id) %>%
summarize(days_used=sum(n())) %>%
mutate(usage = case_when(
days_used >= 1 & days_used <= 10 ~ "low use",
days_used >= 11 & days_used <= 20 ~ "moderate use",
days_used >= 21 & days_used <= 31 ~ "high use",
))
head(daily_use)
## # A tibble: 6 × 3
## id days_used usage
## <dbl> <int> <chr>
## 1 1503960366 25 high use
## 2 1644430081 4 low use
## 3 1844505072 3 low use
## 4 1927972279 5 low use
## 5 2026352035 28 high use
## 6 2320127002 1 low use
We will now create a percentage data frame to better visualize the results in the graph. We are also ordering our usage levels.
daily_use_percent <- daily_use %>%
group_by(usage) %>%
summarise(total = n()) %>%
mutate(totals = sum(total)) %>%
group_by(usage) %>%
summarise(total_percent = total / totals) %>%
mutate(labels = scales::percent(total_percent))
daily_use_percent$usage <- factor(daily_use_percent$usage, levels = c("high use", "moderate use", "low use"))
head(daily_use_percent)
## # A tibble: 3 × 3
## usage total_percent labels
## <fct> <dbl> <chr>
## 1 high use 0.5 50%
## 2 low use 0.375 38%
## 3 moderate use 0.125 12%
Now that we have our new table we can create a percentage dataframe to better visualize the results in the graph. we are also ordering our usage levels.
daily_use_percent %>%
ggplot(aes(x="",y=total_percent, fill=usage)) +
geom_bar(stat = "identity", width = 1)+
coord_polar("y", start=0)+
theme_minimal()+
theme(axis.title.x= element_blank(),
axis.title.y = element_blank(),
panel.border = element_blank(),
panel.grid = element_blank(),
axis.ticks = element_blank(),
axis.text.x = element_blank(),
plot.title = element_text(hjust = 0.5, size=14, face = "bold")) +
geom_text(aes(label = labels),
position = position_stack(vjust = 0.5))+
scale_fill_manual(values = c("#006633","#00e673","#80ffbf"),
labels = c("High use - 21 to 31 days",
"Moderate use - 11 to 20 days",
"Low use - 1 to 10 days"))+
labs(title="Daily use of smart device")
Analyzing our results we can see that
The findings of fitbit dataset exposed the correlation between usage of fitness app and health. Fitness app is a motivator and have close relationship with the users. This findings depicted different activities users involved and how to tract their help trend.