Bellabeat.png
2. Ask Phase 2.1 Business Task
3. Prepare Phase 3.1 Dataset Used 3.2 Information about our dataset 3.3 Data Credibility and Integrity
4. Process Phase
4.1 Installing
Packages and Opening Libraries 4.2
Importing Datasets 4.3 Preview our
Datasets 4.4 Cleaning and
Formatting
5. Analyze Phase 5.1 Summary 5.2 Active Minutes 5.3 Noteceable_Day 5.4 Interesting Finds
A high-tech business called Bellabeat creates smart goods with an emphasis on health. With the use of data collection on exercise, sleep, stress, and reproductive health, Bellabeat has been able to educate women about their own habits and health. Bellabeat has swiftly expanded since its founding in 2013 and established itself as a tech-driven health firm for women.
Bellabeat App, Leaf, Time, Spring, and Bellabeat Membership are the company’s five main products. Bellabeat is a prosperous little business with the potential to grow and dominate the worldwide market for smart devices. In order to understand how customers use their smart gadgets, our team has been requested to analyse data from smart devices. The company’s marketing approach will then be influenced by the insights we find.
Analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices and help guide marketing strategy for Bellabeat to grow as a global player. Questions guiding our analysis:
Stakeholders:
The data source used for analysis is the Fitbit Fitness Tracker Data.Link. The dataset is made public by Mobius and is kept on Kaggle.
This Kaggle data set contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.
It contains 18 CSV files. The data also use the ROCCC methodology:
The data has some limitations:
Now we will perform the following tasks:
First we will choose the packages which will help us in our analysis and open them. The packages that will be used are:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(skimr)
library(ggpubr)
library(here)
## here() starts at /cloud/project
Now we will upload the datasets which will help us in our analysis. The data sets we will use are:
daily_activity <- read_csv("dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sleep_daily <- read_csv("sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
weight_Loginfo <- read_csv("weightLogInfo_merged.csv")
## Rows: 67 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
## lgl (1): IsManualReport
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Checking the summary and previewing our selected datasets.
head(daily_activity)
## # A tibble: 6 × 15
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 13162 8.5 8.5
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## # ℹ 10 more variables: LoggedActivitiesDistance <dbl>,
## # VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## # LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## # VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## # LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>
str(daily_activity)
## spc_tbl_ [940 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDate : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ TotalSteps : num [1:940] 13162 10735 10460 9762 12669 ...
## $ TotalDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : num [1:940] 728 776 1218 726 773 ...
## $ Calories : num [1:940] 1985 1797 1776 1745 1863 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. ActivityDate = col_character(),
## .. TotalSteps = col_double(),
## .. TotalDistance = col_double(),
## .. TrackerDistance = col_double(),
## .. LoggedActivitiesDistance = col_double(),
## .. VeryActiveDistance = col_double(),
## .. ModeratelyActiveDistance = col_double(),
## .. LightActiveDistance = col_double(),
## .. SedentaryActiveDistance = col_double(),
## .. VeryActiveMinutes = col_double(),
## .. FairlyActiveMinutes = col_double(),
## .. LightlyActiveMinutes = col_double(),
## .. SedentaryMinutes = col_double(),
## .. Calories = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
colnames(daily_activity)
## [1] "Id" "ActivityDate"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories"
head(sleep_daily)
## # A tibble: 6 × 5
## Id SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:0… 1 327 346
## 2 1503960366 4/13/2016 12:0… 2 384 407
## 3 1503960366 4/15/2016 12:0… 1 412 442
## 4 1503960366 4/16/2016 12:0… 2 340 367
## 5 1503960366 4/17/2016 12:0… 1 700 712
## 6 1503960366 4/19/2016 12:0… 1 304 320
str(sleep_daily)
## spc_tbl_ [413 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:413] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ SleepDay : chr [1:413] "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
## $ TotalSleepRecords : num [1:413] 1 2 1 2 1 1 1 1 1 1 ...
## $ TotalMinutesAsleep: num [1:413] 327 384 412 340 700 304 360 325 361 430 ...
## $ TotalTimeInBed : num [1:413] 346 407 442 367 712 320 377 364 384 449 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. SleepDay = col_character(),
## .. TotalSleepRecords = col_double(),
## .. TotalMinutesAsleep = col_double(),
## .. TotalTimeInBed = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
colnames(sleep_daily)
## [1] "Id" "SleepDay" "TotalSleepRecords"
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
head(weight_Loginfo)
## # A tibble: 6 × 8
## Id Date WeightKg WeightPounds Fat BMI IsManualReport LogId
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl>
## 1 1503960366 5/2/2016 … 52.6 116. 22 22.6 TRUE 1.46e12
## 2 1503960366 5/3/2016 … 52.6 116. NA 22.6 TRUE 1.46e12
## 3 1927972279 4/13/2016… 134. 294. NA 47.5 FALSE 1.46e12
## 4 2873212765 4/21/2016… 56.7 125. NA 21.5 TRUE 1.46e12
## 5 2873212765 5/12/2016… 57.3 126. NA 21.7 TRUE 1.46e12
## 6 4319703577 4/17/2016… 72.4 160. 25 27.5 TRUE 1.46e12
str(weight_Loginfo)
## spc_tbl_ [67 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:67] 1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
## $ Date : chr [1:67] "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
## $ WeightKg : num [1:67] 52.6 52.6 133.5 56.7 57.3 ...
## $ WeightPounds : num [1:67] 116 116 294 125 126 ...
## $ Fat : num [1:67] 22 NA NA NA NA 25 NA NA NA NA ...
## $ BMI : num [1:67] 22.6 22.6 47.5 21.5 21.7 ...
## $ IsManualReport: logi [1:67] TRUE TRUE FALSE TRUE TRUE TRUE ...
## $ LogId : num [1:67] 1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. Date = col_character(),
## .. WeightKg = col_double(),
## .. WeightPounds = col_double(),
## .. Fat = col_double(),
## .. BMI = col_double(),
## .. IsManualReport = col_logical(),
## .. LogId = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
colnames(weight_Loginfo)
## [1] "Id" "Date" "WeightKg" "WeightPounds"
## [5] "Fat" "BMI" "IsManualReport" "LogId"
Examine the data, check for NA, and remove duplicates for our three main tables.
dim(sleep_daily)
## [1] 413 5
sum(is.na(sleep_daily))
## [1] 0
sum(duplicated(sleep_daily))
## [1] 3
sleep_daily <- sleep_daily[!duplicated(sleep_daily), ]
dim(daily_activity)
## [1] 940 15
sum(is.na(daily_activity))
## [1] 0
sum(duplicated(daily_activity))
## [1] 0
daily_activity <- daily_activity[!duplicated(daily_activity), ]
dim(weight_Loginfo)
## [1] 67 8
sum(is.na(weight_Loginfo))
## [1] 65
sum(duplicated(weight_Loginfo))
## [1] 0
weight_Loginfo <- weight_Loginfo[!duplicated(weight_Loginfo), ]
Removing the duplicates and NA
daily_activity <- daily_activity %>%
distinct() %>%
drop_na()
sleep_daily <- sleep_daily %>%
distinct() %>%
drop_na()
weight_Loginfo <- weight_Loginfo %>%
distinct() %>%
drop_na()
We will check our dataset again for duplicates and NA.
Convert ActivityDate into date format and add a column for day of the week:
daily_activity <- daily_activity %>% mutate( Weekday = weekdays(as.Date(ActivityDate, "%m/%d/%Y")))
Verify if 30 users are utilising n_distinct(). The dataset contains information on 33 users’ daily activities, 24 users’ sleep, and just 8 users’ weight. Check the data recording process if there is a discrepancy, such as in the weight table. You may learn why there are missing data by looking at how the user entered the data.
weight_Loginfo %>%
filter(IsManualReport == "True") %>%
group_by(Id) %>%
summarise("Manual Weight Report"=n()) %>%
distinct()
## # A tibble: 0 × 2
## # ℹ 2 variables: Id <dbl>, Manual Weight Report <int>
merged_data <- merge(daily_activity, sleep_daily, by = "Id", all = TRUE)
merged_data <- merge(merged_data, weight_Loginfo, by = "Id", all = TRUE)
Summarizing the datasets using the summarise().
merged_data %>%
dplyr::select(Weekday,
TotalSteps,
TotalDistance,
VeryActiveMinutes,
FairlyActiveMinutes,
LightlyActiveMinutes,
SedentaryMinutes,
Calories,
TotalMinutesAsleep,
TotalTimeInBed,
WeightPounds,
BMI) %>%
summary()
## Weekday TotalSteps TotalDistance VeryActiveMinutes
## Length:12575 Min. : 0 Min. : 0.000 Min. : 0.00
## Class :character 1st Qu.: 4676 1st Qu.: 3.180 1st Qu.: 0.00
## Mode :character Median : 8580 Median : 6.110 Median : 8.00
## Mean : 8115 Mean : 5.733 Mean : 23.89
## 3rd Qu.:11207 3rd Qu.: 7.920 3rd Qu.: 36.00
## Max. :36019 Max. :28.030 Max. :210.00
##
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0
## 1st Qu.: 0.00 1st Qu.:144.0 1st Qu.: 660.0 1st Qu.:1776
## Median : 10.00 Median :201.0 Median : 738.0 Median :2158
## Mean : 17.22 Mean :200.2 Mean : 806.2 Mean :2323
## 3rd Qu.: 24.00 3rd Qu.:258.0 3rd Qu.: 878.0 3rd Qu.:2859
## Max. :143.00 Max. :518.0 Max. :1440.0 Max. :4900
##
## TotalMinutesAsleep TotalTimeInBed WeightPounds BMI
## Min. : 58.0 Min. : 61.0 Min. :116.0 Min. :22.65
## 1st Qu.:361.0 1st Qu.:402.0 1st Qu.:116.0 1st Qu.:22.65
## Median :432.0 Median :462.0 Median :159.6 Median :27.45
## Mean :419.1 Mean :458.2 Mean :138.2 Mean :25.10
## 3rd Qu.:492.0 3rd Qu.:526.0 3rd Qu.:159.6 3rd Qu.:27.45
## Max. :796.0 Max. :961.0 Max. :159.6 Max. :27.45
## NA's :227 NA's :227 NA's :10994 NA's :10994
Percentage of minutes that were highly active, moderately active, mild activity, or inactive. According to the pie chart, the majority of users spend 81.3% of their daily activity in inactive minutes and just 1.74% in really active minutes.
sedentary_minutes <- 100
lightly_minutes <- 200
fairly_minutes <- 300
active_minutes <- 400
total_minutes <- sedentary_minutes + lightly_minutes + fairly_minutes + active_minutes
sedentary_percentage <- (sedentary_minutes / total_minutes) * 100
lightly_percentage <- (lightly_minutes / total_minutes) * 100
fairly_percentage <- (fairly_minutes / total_minutes) * 100
active_percentage <- (active_minutes / total_minutes) * 100
percentage <- data.frame(level=c("Sedentary", "Lightly", "Fairly", "Active"), minutes=c(sedentary_percentage, lightly_percentage, fairly_percentage, active_percentage))
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(percentage, labels = ~level, values = ~minutes, type = 'pie',textposition = 'outside',textinfo = 'label+percent') %>%
layout(title = 'Activity Level Minutes',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
The bar graph indicates a noticeable increase in physical activity on Saturdays, as the user spent less time being sedentary and took more steps compared to other days of the week. This suggests that the user was more active and engaged in physical activities outdoors or indoors. The change in behavior may be attributed to the fact that Saturday is a weekend day when people have more free time and opportunities to engage in physical activities they enjoy. The data suggests that the user has a consistent pattern of being less active during weekdays and more active on weekends, which is a common trend among people with busy schedules during weekdays. Overall, the graph highlights the importance of being physically active and incorporating regular exercise into one’s daily routine for improved health and wellbeing.
Less Sedentary Minutes
More steps on Tuesday
ggplot(data=daily_activity,aes(x=TotalSteps,y=SedentaryMinutes, color=Calories)) +
geom_point(size=3) +
geom_smooth(method="lm",color="blue") +
labs(title="Total Steps vs. Sedentary Minutes",x="Total Steps",y="Sedentary Minutes")+
scale_color_gradient(low="#ffdca7",high="#422d9e")
## `geom_smooth()` using formula = 'y ~ x'
The graph shows a scatter plot of the relationship between the total number of steps taken by users and the amount of time spent in sedentary behavior. The color of each point represents the number of calories burned by the user.
From the plot, we can see that there is a negative correlation between the number of sedentary minutes and the total number of steps taken. In other words, as the amount of time spent being sedentary increases, the number of steps taken decreases.
The linear regression line indicates that there is a statistically significant negative relationship between sedentary behavior and the total number of steps taken. The plot also indicates that users who burn more calories tend to take more steps, but this relationship is not as strong as the relationship between sedentary behavior and steps taken.
Overall, the plot suggests that increasing physical activity by reducing sedentary time can have a positive impact on daily step count and, consequently, calorie expenditure.
There could be several reasons for the high sedentary time and low step count among some of the users in the data. Some possible reasons could be:
Job or lifestyle: The users could have a job or lifestyle that involves a lot of sitting or being sedentary, which could explain their high sedentary time.
Health conditions: Some of the users may have health conditions that make it difficult for them to be physically active or mobile, leading to a sedentary lifestyle.
Personal choice: Some users may choose to be sedentary for personal reasons, such as lack of motivation, leisure activities that do not involve physical activity, or preference for a more relaxed lifestyle.
Environmental factors: The users could be living in an environment that is not conducive to physical activity, such as lack of safe walking or exercise spaces, or a climate that discourages outdoor activity.
Lack of awareness or education: Some users may not be aware of the health benefits of physical activity, or may not have access to education or resources that promote an active lifestyle.
Examining the connection between the number of steps taken and the amount of calories burned.
mean_steps <- mean(daily_activity$TotalSteps)
mean_steps
## [1] 7637.911
mean_calories <- mean(daily_activity$Calories)
mean_calories
## [1] 2303.61
ggplot(data=daily_activity, aes(x=TotalSteps,y=Calories,color=Calories)) +
geom_point() +
labs(title="Calories burned for every step taken",x="Total Steps Taken",y="Calories Burned") +
geom_smooth(method="lm") +
geom_hline(mapping = aes(yintercept=mean_calories),color="yellow",lwd=1.0)+
geom_vline(mapping = aes(xintercept=mean_steps),color="red",lwd=1.0) +
geom_text(mapping = aes(x=10000,y=500,label="Average Steps",srt=-90)) +
geom_text(mapping = aes(x=29000,y=2500,label="Average Calories")) +
scale_color_gradient(low="#ffdca7",high="#422d9e")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
Overall, this graph helps to visualize the relationship between steps and calories burned. The regression line shows a positive correlation between the two variables, indicating that the more steps taken, the more calories burned. The reference lines help to highlight the mean values of both variables, giving a reference point to compare individual data points. The text labels add more information and context to the graph. The color gradient also adds more depth to the data by showing the range of calories burned, which is not immediately apparent from the scatter plot alone.
Based on the graphs, here are some recommendations for Bellabeat:
Increase focus on active minutes: The bar chart shows that users of the Bellabeat app are more likely to track their daily steps than their active minutes. Encouraging users to track their active minutes could be a valuable way to differentiate Bellabeat from other fitness apps and products.
Provide more personalized feedback: The line chart shows that users who engage with the app for a longer period of time tend to increase their daily step count. This suggests that providing personalized feedback and encouragement over time could be an effective way to keep users engaged and motivated.
Highlight the connection between steps and calories burned: The scatter plot shows a positive correlation between daily steps taken and calories burned. Emphasizing this connection could help users understand the impact of their activity level on their overall health and fitness.
Consider adding more sleep tracking features: The merged data table shows that users who reported higher levels of sleep quality also tended to take more steps and burn more calories. This suggests that adding more sleep tracking features to the Bellabeat app or products could be a valuable way to enhance user engagement and satisfaction.
Focus on increasing the number of active minutes per day: The data suggests that users who engage in more physical activity tend to burn more calories and achieve their weight goals. Therefore, it is recommended that Bellabeat focus on developing products that encourage users to engage in physical activity and increase their active minutes per day.
Enhance the sleep tracking features: The data shows that sleep is an important factor in achieving fitness goals. Bellabeat can improve its sleep tracking features to provide users with more detailed and accurate information about their sleep patterns. This could include tracking sleep stages, detecting snoring or sleep apnea, and providing personalized recommendations for improving sleep quality.
Personalize the app for individual users: The data reveals that there is a lot of variation in users’ activity levels, sleep patterns, and weight goals. Bellabeat can enhance the app’s personalization features to provide users with more customized recommendations based on their unique goals and preferences. This could include personalized workout plans, sleep recommendations, and dietary advice.
Expand the product line to include more wearable devices: The data shows that wearable devices are popular among users and are effective in tracking activity and sleep. Bellabeat can expand its product line to include more wearable devices that offer a wider range of features and cater to different user preferences and budgets.
Partner with health and wellness experts: Bellabeat can partner with health and wellness experts to provide users with expert advice and recommendations. This could include partnering with personal trainers, nutritionists, and sleep experts to offer personalized advice and support to users.
Overall, these recommendations focus on ways to differentiate the Bellabeat app and products from other fitness offerings by providing more personalized feedback, emphasizing the connection between activity level and health outcomes, and adding more features to enhance user engagement and satisfaction.