Introduction

This is a capstone project as a part of my Google Data Analytics Professional Certificate course. For the analysis I will be using R programming language and RStudio IDE for it’s easy statistical analysis tools and data visualizationsThis is a capstone project as a part of my Google Data Analytics Professional Certificate course. For the analysis I will be using R programming language and RStudio IDE for it’s easy statistical analysis tools and data visualizations.

For this project following data analysis steps will be followed :

Ask - Prepare - Process - Analyze - Share - Act

Following Case Study Roadmap will be followed on each data analysis process

Code, when needed on the step. Key tasks, as a checklist. Deliverable, as a checklist.

Scenario

You are a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. You have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights you discover will then help guide marketing strategy for the company. You will present your analysis to the Bellabeat executive team along with your high-level recommendations for Bellabeat’s marketing strategy.

About a company

Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company

Ask

Sršen asks you to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices. She then wants you to select one Bellabeat product to apply these insights to in your presentation. These questions will guide your analysis:

Key tasks

The main objecive is to build marketing strategy by analyzing smart device usage data to derive meaningful insights into how consumers utilize non-Bellabeat smart devices.

Urška Sršen and Sando Mur, Bellabeat marketing analytics team

Deliverable

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(ggplot2)
library(geosphere)
## The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
## which was just loaded, were retired in October 2023.
## Please refer to R-spatial evolution reports for details, especially
## https://r-spatial.org/r/2023/05/15/evolution4.html.
## It may be desirable to make the sf package available;
## package maintainers should consider adding sf to Suggests:.
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(dplyr)
library(ggmap)
## ℹ Google's Terms of Service: <https://mapsplatform.google.com>
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
library(knitr)

Prepare

Now, let’s prepare data for exploration

Key tasks

  • Check the data for errors.

Importing Dataset

For this project, i will use FitBit Fitness Tracker Data

activity <- read.csv("C:/Users/ARTHUR/Case Study 2/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
intensities <- read.csv("C:/Users/ARTHUR/Case Study 2/Fitabase Data 4.12.16-5.12.16/hourlyIntensities_merged.csv")
calories <- read.csv("C:/Users/ARTHUR/Case Study 2/Fitabase Data 4.12.16-5.12.16/hourlyCalories_merged.csv")
sleep <-read.csv("C:/Users/ARTHUR/Case Study 2/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
weight <- read.csv("C:/Users/ARTHUR/Case Study 2/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")

To verify the datasets were imported correctly, i used the kable() and head() function

kable(head(activity, 10))
Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
1503960366 4/12/2016 13162 8.50 8.50 0 1.88 0.55 6.06 0 25 13 328 728 1985
1503960366 4/13/2016 10735 6.97 6.97 0 1.57 0.69 4.71 0 21 19 217 776 1797
1503960366 4/14/2016 10460 6.74 6.74 0 2.44 0.40 3.91 0 30 11 181 1218 1776
1503960366 4/15/2016 9762 6.28 6.28 0 2.14 1.26 2.83 0 29 34 209 726 1745
1503960366 4/16/2016 12669 8.16 8.16 0 2.71 0.41 5.04 0 36 10 221 773 1863
1503960366 4/17/2016 9705 6.48 6.48 0 3.19 0.78 2.51 0 38 20 164 539 1728
1503960366 4/18/2016 13019 8.59 8.59 0 3.25 0.64 4.71 0 42 16 233 1149 1921
1503960366 4/19/2016 15506 9.88 9.88 0 3.53 1.32 5.03 0 50 31 264 775 2035
1503960366 4/20/2016 10544 6.68 6.68 0 1.96 0.48 4.24 0 28 12 205 818 1786
1503960366 4/21/2016 9819 6.34 6.34 0 1.34 0.35 4.65 0 19 8 211 838 1775

Process

Cleaning data for analysis or manipulation of data

Key tasks

  • Choose your tools.
  • Transform the data so you can work with it effectively.
  • Document the cleaning process.

Deliverable

  • Documentation of any clean

Inconsistency was seen on how the date were formatted across the datasets, so i need need to fix it.

# intensities
intensities$ActivityHour=as.POSIXct(intensities$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
intensities$time <- format(intensities$ActivityHour, format = "%H:%M:%S")
intensities$date <- format(intensities$ActivityHour, format = "%m/%d/%y")
# calories
calories$ActivityHour=as.POSIXct(calories$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
calories$time <- format(calories$ActivityHour, format = "%H:%M:%S")
calories$date <- format(calories$ActivityHour, format = "%m/%d/%y")
# activity
activity$ActivityDate=as.POSIXct(activity$ActivityDate, format="%m/%d/%Y", tz=Sys.timezone())
activity$date <- format(activity$ActivityDate, format = "%m/%d/%y")
# sleep
sleep$SleepDay=as.POSIXct(sleep$SleepDay, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
sleep$date <- format(sleep$SleepDay, format = "%m/%d/%y")

since consistency has been achieved among the datasets, it’s time exploration.

Analyze

Since the data has been prepared and formatted, it’s time for analysis

Key tasks

  • Perform calculations.
  • Identify trends and relationships

Exploring and summarizing data

n_distinct(activity$Id)
## [1] 33
n_distinct(intensities$Id)
## [1] 33
n_distinct(calories$Id)
## [1] 33
n_distinct(sleep$Id)
## [1] 24
n_distinct(weight$Id)
## [1] 8

Here, we learnt the number of participants in each dataset There are 33 participants in activity, intensities, and calories respectively. Then the sleep and weight has 24 and 8 participants respectively. The number participants in weight is not enough to be used in recommendation and conclusion as it very small compared to others.

Let’s have an understanding of the data through summary.

# activity
activity %>%
  select(TotalSteps,
         TotalDistance,
         SedentaryMinutes,
         Calories) %>%
  summary
##    TotalSteps    TotalDistance    SedentaryMinutes    Calories   
##  Min.   :    0   Min.   : 0.000   Min.   :   0.0   Min.   :   0  
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 729.8   1st Qu.:1828  
##  Median : 7406   Median : 5.245   Median :1057.5   Median :2134  
##  Mean   : 7638   Mean   : 5.490   Mean   : 991.2   Mean   :2304  
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:1229.5   3rd Qu.:2793  
##  Max.   :36019   Max.   :28.030   Max.   :1440.0   Max.   :4900
# explore num of active minutes per category
activity %>%
  select(VeryActiveMinutes,
    FairlyActiveMinutes,
    LightlyActiveMinutes) %>%
  summary()
##  VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes
##  Min.   :  0.00    Min.   :  0.00      Min.   :  0.0       
##  1st Qu.:  0.00    1st Qu.:  0.00      1st Qu.:127.0       
##  Median :  4.00    Median :  6.00      Median :199.0       
##  Mean   : 21.16    Mean   : 13.56      Mean   :192.8       
##  3rd Qu.: 32.00    3rd Qu.: 19.00      3rd Qu.:264.0       
##  Max.   :210.00    Max.   :143.00      Max.   :518.0
# explore num of active distance per category
activity %>%
  select(VeryActiveDistance,
         ModeratelyActiveDistance,
         LightActiveDistance) %>%
  summary()
##  VeryActiveDistance ModeratelyActiveDistance LightActiveDistance
##  Min.   : 0.000     Min.   :0.0000           Min.   : 0.000     
##  1st Qu.: 0.000     1st Qu.:0.0000           1st Qu.: 1.945     
##  Median : 0.210     Median :0.2400           Median : 3.365     
##  Mean   : 1.503     Mean   :0.5675           Mean   : 3.341     
##  3rd Qu.: 2.053     3rd Qu.:0.8000           3rd Qu.: 4.782     
##  Max.   :21.920     Max.   :6.4800           Max.   :10.710
# calories
calories %>%
  select(Calories) %>%
  summary()
##     Calories     
##  Min.   : 42.00  
##  1st Qu.: 63.00  
##  Median : 83.00  
##  Mean   : 97.39  
##  3rd Qu.:108.00  
##  Max.   :948.00
# sleep
sleep %>%
  select(TotalSleepRecords,
         TotalMinutesAsleep,
         TotalTimeInBed) %>%
  summary()
##  TotalSleepRecords TotalMinutesAsleep TotalTimeInBed 
##  Min.   :1.000     Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:1.000     1st Qu.:361.0      1st Qu.:403.0  
##  Median :1.000     Median :433.0      Median :463.0  
##  Mean   :1.119     Mean   :419.5      Mean   :458.6  
##  3rd Qu.:1.000     3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :3.000     Max.   :796.0      Max.   :961.0
# weight
weight %>%
  select(WeightKg,
         BMI) %>%
  summary()
##     WeightKg           BMI       
##  Min.   : 52.60   Min.   :21.45  
##  1st Qu.: 61.40   1st Qu.:23.96  
##  Median : 62.50   Median :24.39  
##  Mean   : 72.04   Mean   :25.19  
##  3rd Qu.: 85.05   3rd Qu.:25.56  
##  Max.   :133.50   Max.   :47.54

Insights from the summary of the data

  • The average total step for a participant is 7638 per day, which is less bit for having health benefits according to the CDC research
  • The average sedmentary time is 991 minutes, which is huge and reflects on the total step per day of participants. It should be reduced.
  • Most of the participants are light active
  • Participants has an average time of 7 hours for 1 sleep

Merging of Data

To further understand the how the physical activity and sleep pattern might be related or influence each other for each participant, i will merge the data activity and sleep using the column Id and date

merged_data <- merge(activity, sleep, by = c('Id', 'date'))
kable(head(merged_data, 10))
Id date ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
1503960366 04/12/16 2016-04-12 13162 8.50 8.50 0 1.88 0.55 6.06 0 25 13 328 728 1985 2016-04-12 1 327 346
1503960366 04/13/16 2016-04-13 10735 6.97 6.97 0 1.57 0.69 4.71 0 21 19 217 776 1797 2016-04-13 2 384 407
1503960366 04/15/16 2016-04-15 9762 6.28 6.28 0 2.14 1.26 2.83 0 29 34 209 726 1745 2016-04-15 1 412 442
1503960366 04/16/16 2016-04-16 12669 8.16 8.16 0 2.71 0.41 5.04 0 36 10 221 773 1863 2016-04-16 2 340 367
1503960366 04/17/16 2016-04-17 9705 6.48 6.48 0 3.19 0.78 2.51 0 38 20 164 539 1728 2016-04-17 1 700 712
1503960366 04/19/16 2016-04-19 15506 9.88 9.88 0 3.53 1.32 5.03 0 50 31 264 775 2035 2016-04-19 1 304 320
1503960366 04/20/16 2016-04-20 10544 6.68 6.68 0 1.96 0.48 4.24 0 28 12 205 818 1786 2016-04-20 1 360 377
1503960366 04/21/16 2016-04-21 9819 6.34 6.34 0 1.34 0.35 4.65 0 19 8 211 838 1775 2016-04-21 1 325 364
1503960366 04/23/16 2016-04-23 14371 9.04 9.04 0 2.81 0.87 5.36 0 41 21 262 732 1949 2016-04-23 1 361 384
1503960366 04/24/16 2016-04-24 10039 6.41 6.41 0 2.92 0.21 3.28 0 39 5 238 709 1788 2016-04-24 1 430 449

Visualization

ggplot(data = activity, aes(x = TotalSteps, y= Calories)) + geom_point() +
  geom_smooth() + labs(title = "Total Steps vs Calories")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(data = sleep) + geom_point(mapping = aes(x = TotalMinutesAsleep, y= TotalTimeInBed)) +
  labs(title = "Total Minutes Asleep vs Total Time In Bed")

intensities_new <- intensities %>%
  group_by(time) %>%
  drop_na() %>%
  summarise(mean_TotalIntensities_new = mean(TotalIntensity))

ggplot(data = intensities_new, aes(x = time, y = mean_TotalIntensities_new)) + geom_histogram(stat = "identity", fill = "skyblue", color = "black") +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(title = "Average Total Intensity vs Time")
## Warning in geom_histogram(stat = "identity", fill = "skyblue", color =
## "black"): Ignoring unknown parameters: `binwidth`, `bins`, and `pad`

ggplot(data = merged_data, aes(x = TotalMinutesAsleep, y = SedentaryMinutes)) +
  geom_point(fill = "skyblue", color = "darkblue")+
  geom_smooth() + labs(title = "Total Minutes Asleep vs Sedentary Minutes")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Insights and Recommendations

By leveraging these insights, Bellabeat can enhance its product features, marketing strategies, and user engagement to improve overall customer satisfaction and drive growth in the smart device market.