This is a capstone project as a part of my Google Data Analytics Professional Certificate course. For the analysis I will be using R programming language and RStudio IDE for it’s easy statistical analysis tools and data visualizationsThis is a capstone project as a part of my Google Data Analytics Professional Certificate course. For the analysis I will be using R programming language and RStudio IDE for it’s easy statistical analysis tools and data visualizations.
For this project following data analysis steps will be followed :
Ask - Prepare - Process - Analyze - Share - Act
Code, when needed on the step. Key tasks, as a checklist. Deliverable, as a checklist.
You are a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. You have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights you discover will then help guide marketing strategy for the company. You will present your analysis to the Bellabeat executive team along with your high-level recommendations for Bellabeat’s marketing strategy.
Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company
Sršen asks you to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices. She then wants you to select one Bellabeat product to apply these insights to in your presentation. These questions will guide your analysis:
The main objecive is to build marketing strategy by analyzing smart device usage data to derive meaningful insights into how consumers utilize non-Bellabeat smart devices.
Urška Sršen and Sando Mur, Bellabeat marketing analytics team
A clear statement of the business task
Identify how consumers use smart device
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(ggplot2)
library(geosphere)
## The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
## which was just loaded, were retired in October 2023.
## Please refer to R-spatial evolution reports for details, especially
## https://r-spatial.org/r/2023/05/15/evolution4.html.
## It may be desirable to make the sf package available;
## package maintainers should consider adding sf to Suggests:.
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(dplyr)
library(ggmap)
## ℹ Google's Terms of Service: <https://mapsplatform.google.com>
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
library(knitr)
Now, let’s prepare data for exploration
For this project, i will use FitBit Fitness Tracker Data
activity <- read.csv("C:/Users/ARTHUR/Case Study 2/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
intensities <- read.csv("C:/Users/ARTHUR/Case Study 2/Fitabase Data 4.12.16-5.12.16/hourlyIntensities_merged.csv")
calories <- read.csv("C:/Users/ARTHUR/Case Study 2/Fitabase Data 4.12.16-5.12.16/hourlyCalories_merged.csv")
sleep <-read.csv("C:/Users/ARTHUR/Case Study 2/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
weight <- read.csv("C:/Users/ARTHUR/Case Study 2/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")
To verify the datasets were imported correctly, i used the kable() and head() function
kable(head(activity, 10))
| Id | ActivityDate | TotalSteps | TotalDistance | TrackerDistance | LoggedActivitiesDistance | VeryActiveDistance | ModeratelyActiveDistance | LightActiveDistance | SedentaryActiveDistance | VeryActiveMinutes | FairlyActiveMinutes | LightlyActiveMinutes | SedentaryMinutes | Calories |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1503960366 | 4/12/2016 | 13162 | 8.50 | 8.50 | 0 | 1.88 | 0.55 | 6.06 | 0 | 25 | 13 | 328 | 728 | 1985 |
| 1503960366 | 4/13/2016 | 10735 | 6.97 | 6.97 | 0 | 1.57 | 0.69 | 4.71 | 0 | 21 | 19 | 217 | 776 | 1797 |
| 1503960366 | 4/14/2016 | 10460 | 6.74 | 6.74 | 0 | 2.44 | 0.40 | 3.91 | 0 | 30 | 11 | 181 | 1218 | 1776 |
| 1503960366 | 4/15/2016 | 9762 | 6.28 | 6.28 | 0 | 2.14 | 1.26 | 2.83 | 0 | 29 | 34 | 209 | 726 | 1745 |
| 1503960366 | 4/16/2016 | 12669 | 8.16 | 8.16 | 0 | 2.71 | 0.41 | 5.04 | 0 | 36 | 10 | 221 | 773 | 1863 |
| 1503960366 | 4/17/2016 | 9705 | 6.48 | 6.48 | 0 | 3.19 | 0.78 | 2.51 | 0 | 38 | 20 | 164 | 539 | 1728 |
| 1503960366 | 4/18/2016 | 13019 | 8.59 | 8.59 | 0 | 3.25 | 0.64 | 4.71 | 0 | 42 | 16 | 233 | 1149 | 1921 |
| 1503960366 | 4/19/2016 | 15506 | 9.88 | 9.88 | 0 | 3.53 | 1.32 | 5.03 | 0 | 50 | 31 | 264 | 775 | 2035 |
| 1503960366 | 4/20/2016 | 10544 | 6.68 | 6.68 | 0 | 1.96 | 0.48 | 4.24 | 0 | 28 | 12 | 205 | 818 | 1786 |
| 1503960366 | 4/21/2016 | 9819 | 6.34 | 6.34 | 0 | 1.34 | 0.35 | 4.65 | 0 | 19 | 8 | 211 | 838 | 1775 |
Cleaning data for analysis or manipulation of data
Inconsistency was seen on how the date were formatted across the datasets, so i need need to fix it.
# intensities
intensities$ActivityHour=as.POSIXct(intensities$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
intensities$time <- format(intensities$ActivityHour, format = "%H:%M:%S")
intensities$date <- format(intensities$ActivityHour, format = "%m/%d/%y")
# calories
calories$ActivityHour=as.POSIXct(calories$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
calories$time <- format(calories$ActivityHour, format = "%H:%M:%S")
calories$date <- format(calories$ActivityHour, format = "%m/%d/%y")
# activity
activity$ActivityDate=as.POSIXct(activity$ActivityDate, format="%m/%d/%Y", tz=Sys.timezone())
activity$date <- format(activity$ActivityDate, format = "%m/%d/%y")
# sleep
sleep$SleepDay=as.POSIXct(sleep$SleepDay, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
sleep$date <- format(sleep$SleepDay, format = "%m/%d/%y")
since consistency has been achieved among the datasets, it’s time exploration.
Since the data has been prepared and formatted, it’s time for analysis
n_distinct(activity$Id)
## [1] 33
n_distinct(intensities$Id)
## [1] 33
n_distinct(calories$Id)
## [1] 33
n_distinct(sleep$Id)
## [1] 24
n_distinct(weight$Id)
## [1] 8
Here, we learnt the number of participants in each dataset There are 33 participants in activity, intensities, and calories respectively. Then the sleep and weight has 24 and 8 participants respectively. The number participants in weight is not enough to be used in recommendation and conclusion as it very small compared to others.
Let’s have an understanding of the data through summary.
# activity
activity %>%
select(TotalSteps,
TotalDistance,
SedentaryMinutes,
Calories) %>%
summary
## TotalSteps TotalDistance SedentaryMinutes Calories
## Min. : 0 Min. : 0.000 Min. : 0.0 Min. : 0
## 1st Qu.: 3790 1st Qu.: 2.620 1st Qu.: 729.8 1st Qu.:1828
## Median : 7406 Median : 5.245 Median :1057.5 Median :2134
## Mean : 7638 Mean : 5.490 Mean : 991.2 Mean :2304
## 3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.:1229.5 3rd Qu.:2793
## Max. :36019 Max. :28.030 Max. :1440.0 Max. :4900
# explore num of active minutes per category
activity %>%
select(VeryActiveMinutes,
FairlyActiveMinutes,
LightlyActiveMinutes) %>%
summary()
## VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes
## Min. : 0.00 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:127.0
## Median : 4.00 Median : 6.00 Median :199.0
## Mean : 21.16 Mean : 13.56 Mean :192.8
## 3rd Qu.: 32.00 3rd Qu.: 19.00 3rd Qu.:264.0
## Max. :210.00 Max. :143.00 Max. :518.0
# explore num of active distance per category
activity %>%
select(VeryActiveDistance,
ModeratelyActiveDistance,
LightActiveDistance) %>%
summary()
## VeryActiveDistance ModeratelyActiveDistance LightActiveDistance
## Min. : 0.000 Min. :0.0000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.: 1.945
## Median : 0.210 Median :0.2400 Median : 3.365
## Mean : 1.503 Mean :0.5675 Mean : 3.341
## 3rd Qu.: 2.053 3rd Qu.:0.8000 3rd Qu.: 4.782
## Max. :21.920 Max. :6.4800 Max. :10.710
# calories
calories %>%
select(Calories) %>%
summary()
## Calories
## Min. : 42.00
## 1st Qu.: 63.00
## Median : 83.00
## Mean : 97.39
## 3rd Qu.:108.00
## Max. :948.00
# sleep
sleep %>%
select(TotalSleepRecords,
TotalMinutesAsleep,
TotalTimeInBed) %>%
summary()
## TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## Min. :1.000 Min. : 58.0 Min. : 61.0
## 1st Qu.:1.000 1st Qu.:361.0 1st Qu.:403.0
## Median :1.000 Median :433.0 Median :463.0
## Mean :1.119 Mean :419.5 Mean :458.6
## 3rd Qu.:1.000 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :3.000 Max. :796.0 Max. :961.0
# weight
weight %>%
select(WeightKg,
BMI) %>%
summary()
## WeightKg BMI
## Min. : 52.60 Min. :21.45
## 1st Qu.: 61.40 1st Qu.:23.96
## Median : 62.50 Median :24.39
## Mean : 72.04 Mean :25.19
## 3rd Qu.: 85.05 3rd Qu.:25.56
## Max. :133.50 Max. :47.54
Insights from the summary of the data
To further understand the how the physical activity and sleep pattern might be related or influence each other for each participant, i will merge the data activity and sleep using the column Id and date
merged_data <- merge(activity, sleep, by = c('Id', 'date'))
kable(head(merged_data, 10))
| Id | date | ActivityDate | TotalSteps | TotalDistance | TrackerDistance | LoggedActivitiesDistance | VeryActiveDistance | ModeratelyActiveDistance | LightActiveDistance | SedentaryActiveDistance | VeryActiveMinutes | FairlyActiveMinutes | LightlyActiveMinutes | SedentaryMinutes | Calories | SleepDay | TotalSleepRecords | TotalMinutesAsleep | TotalTimeInBed |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1503960366 | 04/12/16 | 2016-04-12 | 13162 | 8.50 | 8.50 | 0 | 1.88 | 0.55 | 6.06 | 0 | 25 | 13 | 328 | 728 | 1985 | 2016-04-12 | 1 | 327 | 346 |
| 1503960366 | 04/13/16 | 2016-04-13 | 10735 | 6.97 | 6.97 | 0 | 1.57 | 0.69 | 4.71 | 0 | 21 | 19 | 217 | 776 | 1797 | 2016-04-13 | 2 | 384 | 407 |
| 1503960366 | 04/15/16 | 2016-04-15 | 9762 | 6.28 | 6.28 | 0 | 2.14 | 1.26 | 2.83 | 0 | 29 | 34 | 209 | 726 | 1745 | 2016-04-15 | 1 | 412 | 442 |
| 1503960366 | 04/16/16 | 2016-04-16 | 12669 | 8.16 | 8.16 | 0 | 2.71 | 0.41 | 5.04 | 0 | 36 | 10 | 221 | 773 | 1863 | 2016-04-16 | 2 | 340 | 367 |
| 1503960366 | 04/17/16 | 2016-04-17 | 9705 | 6.48 | 6.48 | 0 | 3.19 | 0.78 | 2.51 | 0 | 38 | 20 | 164 | 539 | 1728 | 2016-04-17 | 1 | 700 | 712 |
| 1503960366 | 04/19/16 | 2016-04-19 | 15506 | 9.88 | 9.88 | 0 | 3.53 | 1.32 | 5.03 | 0 | 50 | 31 | 264 | 775 | 2035 | 2016-04-19 | 1 | 304 | 320 |
| 1503960366 | 04/20/16 | 2016-04-20 | 10544 | 6.68 | 6.68 | 0 | 1.96 | 0.48 | 4.24 | 0 | 28 | 12 | 205 | 818 | 1786 | 2016-04-20 | 1 | 360 | 377 |
| 1503960366 | 04/21/16 | 2016-04-21 | 9819 | 6.34 | 6.34 | 0 | 1.34 | 0.35 | 4.65 | 0 | 19 | 8 | 211 | 838 | 1775 | 2016-04-21 | 1 | 325 | 364 |
| 1503960366 | 04/23/16 | 2016-04-23 | 14371 | 9.04 | 9.04 | 0 | 2.81 | 0.87 | 5.36 | 0 | 41 | 21 | 262 | 732 | 1949 | 2016-04-23 | 1 | 361 | 384 |
| 1503960366 | 04/24/16 | 2016-04-24 | 10039 | 6.41 | 6.41 | 0 | 2.92 | 0.21 | 3.28 | 0 | 39 | 5 | 238 | 709 | 1788 | 2016-04-24 | 1 | 430 | 449 |
ggplot(data = activity, aes(x = TotalSteps, y= Calories)) + geom_point() +
geom_smooth() + labs(title = "Total Steps vs Calories")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(data = sleep) + geom_point(mapping = aes(x = TotalMinutesAsleep, y= TotalTimeInBed)) +
labs(title = "Total Minutes Asleep vs Total Time In Bed")
intensities_new <- intensities %>%
group_by(time) %>%
drop_na() %>%
summarise(mean_TotalIntensities_new = mean(TotalIntensity))
ggplot(data = intensities_new, aes(x = time, y = mean_TotalIntensities_new)) + geom_histogram(stat = "identity", fill = "skyblue", color = "black") +
theme(axis.text.x = element_text(angle = 90)) +
labs(title = "Average Total Intensity vs Time")
## Warning in geom_histogram(stat = "identity", fill = "skyblue", color =
## "black"): Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
ggplot(data = merged_data, aes(x = TotalMinutesAsleep, y = SedentaryMinutes)) +
geom_point(fill = "skyblue", color = "darkblue")+
geom_smooth() + labs(title = "Total Minutes Asleep vs Sedentary Minutes")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Participants’ average total steps per day (7638) may be insufficient for substantial health benefits according to CDC research. Bellabeat should encourage increased physical activity among its users.
The significant sedentary time (average 991 minutes) indicates a need for reducing sedentary behavior, which could contribute to overall health improvement. Bellabeat can develop features to remind users to stand, move, or take short breaks.
Most participants fall into the ‘light active’ category. Bellabeat should motivate users to increase their activity levels and set achievable activity goals.
The relationship between total steps and calories burnt highlights the importance of regular physical activity. Bellabeat can emphasize this relationship in its marketing strategy to encourage more activity.
There’s a linear relationship between total minutes asleep and total time in bed, suggesting a need to improve sleep duration and quality. Bellabeat could create features or reminders to enhance users’ sleep routines.
Users are most active between 5 am and 10 pm. Bellabeat can use this information to encourage early morning or late evening workouts.
The negative relationship between sedentary minutes and total minutes asleep indicates an opportunity for Bellabeat to educate users about the adverse effects of prolonged sitting on sleep quality.
By leveraging these insights, Bellabeat can enhance its product features, marketing strategies, and user engagement to improve overall customer satisfaction and drive growth in the smart device market.