Capstone Project: Case Study 2

Introduction

This is a capstone project as a part of my Google Data Analytics Professional Certificate course. For the analysis I will be using R programming language and RStudio IDE for it’s easy statistical analysis tools and data visualizationsThis is a capstone project as a part of my Google Data Analytics Professional Certificate course. For the analysis I will be using R programming language and RStudio IDE for it’s easy statistical analysis tools and data visualizations.

For this project following data analysis steps will be followed :

Ask - Prepare - Process - Analyze - Share - Act

Following Case Study Roadmap will be followed on each data analysis process

Code, when needed on the step. Key tasks, as a checklist. Deliverable, as a checklist.

Scenario

You are a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. You have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights you discover will then help guide marketing strategy for the company. You will present your analysis to the Bellabeat executive team along with your high-level recommendations for Bellabeat’s marketing strategy.

About a company

Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company

Ask

Sršen asks you to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices. She then wants you to select one Bellabeat product to apply these insights to in your presentation. These questions will guide your analysis:

What are some trends in smart device usage?
How could these trends apply to Bellabeat customers?
How could these trends help influence Bellabeat marketing strategy?

Key tasks

Identify clear business task

The main objecive is to build marketing strategy by analyzing smart device usage data to derive meaningful insights into how consumers utilize non-Bellabeat smart devices.

Consider key stakeholders

Urška Sršen and Sando Mur, Bellabeat marketing analytics team

Deliverable

A clear statement of the business task
Identify how consumers use smart device

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)
library(ggplot2)
library(geosphere)

## The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
## which was just loaded, were retired in October 2023.
## Please refer to R-spatial evolution reports for details, especially
## https://r-spatial.org/r/2023/05/15/evolution4.html.
## It may be desirable to make the sf package available;
## package maintainers should consider adding sf to Suggests:.

library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(dplyr)
library(ggmap)

## ℹ Google's Terms of Service: <https://mapsplatform.google.com>
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.

library(knitr)

Prepare

Now, let’s prepare data for exploration

Key tasks

Check the data for errors.

Importing Dataset

For this project, i will use FitBit Fitness Tracker Data

activity <- read.csv("C:/Users/ARTHUR/Case Study 2/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
intensities <- read.csv("C:/Users/ARTHUR/Case Study 2/Fitabase Data 4.12.16-5.12.16/hourlyIntensities_merged.csv")
calories <- read.csv("C:/Users/ARTHUR/Case Study 2/Fitabase Data 4.12.16-5.12.16/hourlyCalories_merged.csv")
sleep <-read.csv("C:/Users/ARTHUR/Case Study 2/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
weight <- read.csv("C:/Users/ARTHUR/Case Study 2/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")

To verify the datasets were imported correctly, i used the kable() and head() function

kable(head(activity, 10))

Id	ActivityDate	TotalSteps	TotalDistance	TrackerDistance	VeryActiveDistance	ModeratelyActiveDistance	LightActiveDistance	VeryActiveMinutes	FairlyActiveMinutes	LightlyActiveMinutes	SedentaryMinutes	Calories
1503960366	4/12/2016	13162	8.50	8.50	1.88	0.55	6.06	25	13	328	728	1985
1503960366	4/13/2016	10735	6.97	6.97	1.57	0.69	4.71	21	19	217	776	1797
1503960366	4/14/2016	10460	6.74	6.74	2.44	0.40	3.91	30	11	181	1218	1776
1503960366	4/15/2016	9762	6.28	6.28	2.14	1.26	2.83	29	34	209	726	1745
1503960366	4/16/2016	12669	8.16	8.16	2.71	0.41	5.04	36	10	221	773	1863
1503960366	4/17/2016	9705	6.48	6.48	3.19	0.78	2.51	38	20	164	539	1728
1503960366	4/18/2016	13019	8.59	8.59	3.25	0.64	4.71	42	16	233	1149	1921
1503960366	4/19/2016	15506	9.88	9.88	3.53	1.32	5.03	50	31	264	775	2035
1503960366	4/20/2016	10544	6.68	6.68	1.96	0.48	4.24	28	12	205	818	1786
1503960366	4/21/2016	9819	6.34	6.34	1.34	0.35	4.65	19	8	211	838	1775

Process

Cleaning data for analysis or manipulation of data

Key tasks

Choose your tools.
Transform the data so you can work with it effectively.
Document the cleaning process.

Deliverable

Documentation of any clean

Inconsistency was seen on how the date were formatted across the datasets, so i need need to fix it.

# intensities
intensities$ActivityHour=as.POSIXct(intensities$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
intensities$time <- format(intensities$ActivityHour, format = "%H:%M:%S")
intensities$date <- format(intensities$ActivityHour, format = "%m/%d/%y")
# calories
calories$ActivityHour=as.POSIXct(calories$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
calories$time <- format(calories$ActivityHour, format = "%H:%M:%S")
calories$date <- format(calories$ActivityHour, format = "%m/%d/%y")
# activity
activity$ActivityDate=as.POSIXct(activity$ActivityDate, format="%m/%d/%Y", tz=Sys.timezone())
activity$date <- format(activity$ActivityDate, format = "%m/%d/%y")
# sleep
sleep$SleepDay=as.POSIXct(sleep$SleepDay, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
sleep$date <- format(sleep$SleepDay, format = "%m/%d/%y")

since consistency has been achieved among the datasets, it’s time exploration.

Analyze

Since the data has been prepared and formatted, it’s time for analysis

Key tasks

Perform calculations.
Identify trends and relationships

Exploring and summarizing data

n_distinct(activity$Id)

## [1] 33

n_distinct(intensities$Id)

## [1] 33

n_distinct(calories$Id)

## [1] 33

n_distinct(sleep$Id)

## [1] 24

n_distinct(weight$Id)

## [1] 8

Here, we learnt the number of participants in each dataset There are 33 participants in activity, intensities, and calories respectively. Then the sleep and weight has 24 and 8 participants respectively. The number participants in weight is not enough to be used in recommendation and conclusion as it very small compared to others.

Let’s have an understanding of the data through summary.

# activity
activity %>%
  select(TotalSteps,
         TotalDistance,
         SedentaryMinutes,
         Calories) %>%
  summary

##    TotalSteps    TotalDistance    SedentaryMinutes    Calories   
##  Min.   :    0   Min.   : 0.000   Min.   :   0.0   Min.   :   0  
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 729.8   1st Qu.:1828  
##  Median : 7406   Median : 5.245   Median :1057.5   Median :2134  
##  Mean   : 7638   Mean   : 5.490   Mean   : 991.2   Mean   :2304  
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:1229.5   3rd Qu.:2793  
##  Max.   :36019   Max.   :28.030   Max.   :1440.0   Max.   :4900

# explore num of active minutes per category
activity %>%
  select(VeryActiveMinutes,
    FairlyActiveMinutes,
    LightlyActiveMinutes) %>%
  summary()

##  VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes
##  Min.   :  0.00    Min.   :  0.00      Min.   :  0.0       
##  1st Qu.:  0.00    1st Qu.:  0.00      1st Qu.:127.0       
##  Median :  4.00    Median :  6.00      Median :199.0       
##  Mean   : 21.16    Mean   : 13.56      Mean   :192.8       
##  3rd Qu.: 32.00    3rd Qu.: 19.00      3rd Qu.:264.0       
##  Max.   :210.00    Max.   :143.00      Max.   :518.0

# explore num of active distance per category
activity %>%
  select(VeryActiveDistance,
         ModeratelyActiveDistance,
         LightActiveDistance) %>%
  summary()

##  VeryActiveDistance ModeratelyActiveDistance LightActiveDistance
##  Min.   : 0.000     Min.   :0.0000           Min.   : 0.000     
##  1st Qu.: 0.000     1st Qu.:0.0000           1st Qu.: 1.945     
##  Median : 0.210     Median :0.2400           Median : 3.365     
##  Mean   : 1.503     Mean   :0.5675           Mean   : 3.341     
##  3rd Qu.: 2.053     3rd Qu.:0.8000           3rd Qu.: 4.782     
##  Max.   :21.920     Max.   :6.4800           Max.   :10.710

# calories
calories %>%
  select(Calories) %>%
  summary()

##     Calories     
##  Min.   : 42.00  
##  1st Qu.: 63.00  
##  Median : 83.00  
##  Mean   : 97.39  
##  3rd Qu.:108.00  
##  Max.   :948.00

# sleep
sleep %>%
  select(TotalSleepRecords,
         TotalMinutesAsleep,
         TotalTimeInBed) %>%
  summary()

##  TotalSleepRecords TotalMinutesAsleep TotalTimeInBed 
##  Min.   :1.000     Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:1.000     1st Qu.:361.0      1st Qu.:403.0  
##  Median :1.000     Median :433.0      Median :463.0  
##  Mean   :1.119     Mean   :419.5      Mean   :458.6  
##  3rd Qu.:1.000     3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :3.000     Max.   :796.0      Max.   :961.0

# weight
weight %>%
  select(WeightKg,
         BMI) %>%
  summary()

##     WeightKg           BMI       
##  Min.   : 52.60   Min.   :21.45  
##  1st Qu.: 61.40   1st Qu.:23.96  
##  Median : 62.50   Median :24.39  
##  Mean   : 72.04   Mean   :25.19  
##  3rd Qu.: 85.05   3rd Qu.:25.56  
##  Max.   :133.50   Max.   :47.54

Insights from the summary of the data

The average total step for a participant is 7638 per day, which is less bit for having health benefits according to the CDC research
The average sedmentary time is 991 minutes, which is huge and reflects on the total step per day of participants. It should be reduced.
Most of the participants are light active
Participants has an average time of 7 hours for 1 sleep

Merging of Data

To further understand the how the physical activity and sleep pattern might be related or influence each other for each participant, i will merge the data activity and sleep using the column Id and date

merged_data <- merge(activity, sleep, by = c('Id', 'date'))
kable(head(merged_data, 10))

Id	date	ActivityDate	TotalSteps	TotalDistance	TrackerDistance	VeryActiveDistance	ModeratelyActiveDistance	LightActiveDistance	VeryActiveMinutes	FairlyActiveMinutes	LightlyActiveMinutes	SedentaryMinutes	Calories	SleepDay	TotalSleepRecords	TotalMinutesAsleep	TotalTimeInBed
1503960366	04/12/16	2016-04-12	13162	8.50	8.50	1.88	0.55	6.06	25	13	328	728	1985	2016-04-12	1	327	346
1503960366	04/13/16	2016-04-13	10735	6.97	6.97	1.57	0.69	4.71	21	19	217	776	1797	2016-04-13	2	384	407
1503960366	04/15/16	2016-04-15	9762	6.28	6.28	2.14	1.26	2.83	29	34	209	726	1745	2016-04-15	1	412	442
1503960366	04/16/16	2016-04-16	12669	8.16	8.16	2.71	0.41	5.04	36	10	221	773	1863	2016-04-16	2	340	367
1503960366	04/17/16	2016-04-17	9705	6.48	6.48	3.19	0.78	2.51	38	20	164	539	1728	2016-04-17	1	700	712
1503960366	04/19/16	2016-04-19	15506	9.88	9.88	3.53	1.32	5.03	50	31	264	775	2035	2016-04-19	1	304	320
1503960366	04/20/16	2016-04-20	10544	6.68	6.68	1.96	0.48	4.24	28	12	205	818	1786	2016-04-20	1	360	377
1503960366	04/21/16	2016-04-21	9819	6.34	6.34	1.34	0.35	4.65	19	8	211	838	1775	2016-04-21	1	325	364
1503960366	04/23/16	2016-04-23	14371	9.04	9.04	2.81	0.87	5.36	41	21	262	732	1949	2016-04-23	1	361	384
1503960366	04/24/16	2016-04-24	10039	6.41	6.41	2.92	0.21	3.28	39	5	238	709	1788	2016-04-24	1	430	449

Visualization

ggplot(data = activity, aes(x = TotalSteps, y= Calories)) + geom_point() +
  geom_smooth() + labs(title = "Total Steps vs Calories")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

There is a positve relationship between the total steps and calories as we can see in the visualization. This tells us that the more active we are the more calories we burn.

ggplot(data = sleep) + geom_point(mapping = aes(x = TotalMinutesAsleep, y= TotalTimeInBed)) +
  labs(title = "Total Minutes Asleep vs Total Time In Bed")

There is a linear relationship between the Total Minutes Asleep and Total Time In Bed.Hence the company members will need to use a reminder to improve their sleep.

intensities_new <- intensities %>%
  group_by(time) %>%
  drop_na() %>%
  summarise(mean_TotalIntensities_new = mean(TotalIntensity))

ggplot(data = intensities_new, aes(x = time, y = mean_TotalIntensities_new)) + geom_histogram(stat = "identity", fill = "skyblue", color = "black") +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(title = "Average Total Intensity vs Time")

## Warning in geom_histogram(stat = "identity", fill = "skyblue", color =
## "black"): Ignoring unknown parameters: `binwidth`, `bins`, and `pad`

We found out that people are most active between the hours of 5am to 10pm, this could be because users sleeps mostly 7 hours. Bellabeat can use this information to remind their users to sleep earlier to improve their sleep
Hours with the most active users are hours between 5pm and 7pm, this could be because of the cool whether and people tend to hit the gym or go for a walk. Bellabeat can use this information to always inform their users to go out for a run.

ggplot(data = merged_data, aes(x = TotalMinutesAsleep, y = SedentaryMinutes)) +
  geom_point(fill = "skyblue", color = "darkblue")+
  geom_smooth() + labs(title = "Total Minutes Asleep vs Sedentary Minutes")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

There is a negative relationship between the sedentary minutes and total minutes asleep, which tells us that there are other things users such as sitting at a desk, watching TV, or using a computer etc.
Bellabeat can use this information to improve their users sleep by recommending that they should reduce their sedentary time

Insights and Recommendations

Participants’ average total steps per day (7638) may be insufficient for substantial health benefits according to CDC research. Bellabeat should encourage increased physical activity among its users.
The significant sedentary time (average 991 minutes) indicates a need for reducing sedentary behavior, which could contribute to overall health improvement. Bellabeat can develop features to remind users to stand, move, or take short breaks.
Most participants fall into the ‘light active’ category. Bellabeat should motivate users to increase their activity levels and set achievable activity goals.
The relationship between total steps and calories burnt highlights the importance of regular physical activity. Bellabeat can emphasize this relationship in its marketing strategy to encourage more activity.
There’s a linear relationship between total minutes asleep and total time in bed, suggesting a need to improve sleep duration and quality. Bellabeat could create features or reminders to enhance users’ sleep routines.
Users are most active between 5 am and 10 pm. Bellabeat can use this information to encourage early morning or late evening workouts.
The negative relationship between sedentary minutes and total minutes asleep indicates an opportunity for Bellabeat to educate users about the adverse effects of prolonged sitting on sleep quality.

By leveraging these insights, Bellabeat can enhance its product features, marketing strategies, and user engagement to improve overall customer satisfaction and drive growth in the smart device market.

Capstone Project: Case Study 2

CertifiedAuthur

2023-10-10

Introduction

Following Case Study Roadmap will be followed on each data analysis process

Scenario

About a company

Ask

Key tasks

Deliverable

Prepare

Key tasks

Importing Dataset

Process

Key tasks

Deliverable

Analyze

Key tasks

Exploring and summarizing data

Merging of Data

Visualization

Insights and Recommendations