Bellabeat Wellness Case Study

Introduction and background

This is an example case study to demonstrate my knowledge and skills in data analytics using R programming. Working as a junior data analyst for the Bellabeat Company, a high-tech company that manufactures health-focused smart products, I show my skills in analyzing and presenting smart device data to gain insight into how consumers are using their smart products.

Upload your CSV files to R

Upload CSV files to the project from the relevant data source: https://www.kaggle.com/arashnic/fitbit

In this project the following two CSV files will be used: * Daily Activity * Daily Calories

Install and load common packages and libraries

You can always install and load packages along the way as you may discover you need different packages after you start your analysis. There are several packages that could be installed and loaded at this stage.

Load CSV files

Here a dataframe named ‘daily_activity’ needs to be created and read in one of the CSV files from the dataset. In addition, the “readr” package should be installed and loaded in order to read CSV files.

install.packages("readr")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)

library(readr)
daily_activity <- read_csv("daily_activity.csv")

## Rows: 940 Columns: 15

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Get the summary of the dataframe

This stage provides with a quick overview of the dataset.

head(daily_activity)

## # A tibble: 6 × 15
##        Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie…
##     <dbl> <chr>             <dbl>         <dbl>           <dbl>            <dbl>
## 1  1.50e9 4/12/2016         13162          8.5             8.5                 0
## 2  1.50e9 4/13/2016         10735          6.97            6.97                0
## 3  1.50e9 4/14/2016         10460          6.74            6.74                0
## 4  1.50e9 4/15/2016          9762          6.28            6.28                0
## 5  1.50e9 4/16/2016         12669          8.16            8.16                0
## 6  1.50e9 4/17/2016          9705          6.48            6.48                0
## # … with 9 more variables: VeryActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>

Organize the data to make it useful

install.packages("dplyr")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)

library(dplyr)
daily_activity %>%
  select(Id, ActivityDate, TotalSteps, Calories)

## # A tibble: 940 × 4
##            Id ActivityDate TotalSteps Calories
##         <dbl> <chr>             <dbl>    <dbl>
##  1 1503960366 4/12/2016         13162     1985
##  2 1503960366 4/13/2016         10735     1797
##  3 1503960366 4/14/2016         10460     1776
##  4 1503960366 4/15/2016          9762     1745
##  5 1503960366 4/16/2016         12669     1863
##  6 1503960366 4/17/2016          9705     1728
##  7 1503960366 4/18/2016         13019     1921
##  8 1503960366 4/19/2016         15506     2035
##  9 1503960366 4/20/2016         10544     1786
## 10 1503960366 4/21/2016          9819     1775
## # … with 930 more rows

Load CSV files

Here a dataframe named ‘daily_calories’ needs to be created and read in one of the CSV files from the dataset.

library(readr)
daily_calories <- read_csv("daily_calories.csv")

## Rows: 940 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, Calories
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Get the summary of the dataframe

A brief description of the dataset could be shown using the following code chunks.

head(daily_calories)

## # A tibble: 6 × 3
##           Id ActivityDay Calories
##        <dbl> <chr>          <dbl>
## 1 1503960366 4/12/2016       1985
## 2 1503960366 4/13/2016       1797
## 3 1503960366 4/14/2016       1776
## 4 1503960366 4/15/2016       1745
## 5 1503960366 4/16/2016       1863
## 6 1503960366 4/17/2016       1728

Clean and organize loaded data for further analysis

This stage of the data analysis is very important as it could reduce possible errors appearing during the analysis stage.

sort the data by Id

library(tidyverse)
daily_activity %>% arrange(Id)

## # A tibble: 940 × 15
##        Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie…
##     <dbl> <chr>             <dbl>         <dbl>           <dbl>            <dbl>
##  1 1.50e9 4/12/2016         13162          8.5             8.5                 0
##  2 1.50e9 4/13/2016         10735          6.97            6.97                0
##  3 1.50e9 4/14/2016         10460          6.74            6.74                0
##  4 1.50e9 4/15/2016          9762          6.28            6.28                0
##  5 1.50e9 4/16/2016         12669          8.16            8.16                0
##  6 1.50e9 4/17/2016          9705          6.48            6.48                0
##  7 1.50e9 4/18/2016         13019          8.59            8.59                0
##  8 1.50e9 4/19/2016         15506          9.88            9.88                0
##  9 1.50e9 4/20/2016         10544          6.68            6.68                0
## 10 1.50e9 4/21/2016          9819          6.34            6.34                0
## # … with 930 more rows, and 9 more variables: VeryActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>

sort the data by Id and Date

daily_activity %>% group_by(Id, ActivityDate) %>% drop_na() %>% summarize(average_TotalSteps=mean(TotalSteps))

## `summarise()` has grouped output by 'Id'. You can override using the `.groups`
## argument.

## # A tibble: 940 × 3
## # Groups:   Id [33]
##            Id ActivityDate average_TotalSteps
##         <dbl> <chr>                     <dbl>
##  1 1503960366 4/12/2016                 13162
##  2 1503960366 4/13/2016                 10735
##  3 1503960366 4/14/2016                 10460
##  4 1503960366 4/15/2016                  9762
##  5 1503960366 4/16/2016                 12669
##  6 1503960366 4/17/2016                  9705
##  7 1503960366 4/18/2016                 13019
##  8 1503960366 4/19/2016                 15506
##  9 1503960366 4/20/2016                 10544
## 10 1503960366 4/21/2016                  9819
## # … with 930 more rows

select unique Id and create users_steps table

daily_users <- daily_activity %>% distinct(Id, .keep_all=TRUE)
users_steps <- daily_users %>%
  select(Id, TotalSteps)

create users_activity table selecting active minutes

users_activity <- daily_users %>% 
  select(Id, VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes, SedentaryMinutes) %>% drop_na()

Analyze and calculate clean data

select the maximum number across multiple columns in the users_activity table

users_activity$Max<-pmax(users_activity$VeryActiveMinutes, users_activity$FairlyActiveMinutes, users_activity$LightlyActiveMinutes, users_activity$SedentaryMinutes)

identify the status of activity for each Id based on the active minutes

users_status <- users_activity %>%
  select(VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes)
users_status$Largest_Column<-colnames(users_status)[apply(users_status,1,which.max)]

count number of each active status

status <- table(users_status['Largest_Column'])

create total_activity table selecting active minutes and sum up values in each row

total_activity <- daily_activity %>%
  select(Id, VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes, SedentaryMinutes) %>% drop_na()
total_activity$sum <- total_activity$VeryActiveMinutes+total_activity$FairlyActiveMinutes+total_activity$LightlyActiveMinutes+total_activity$SedentaryMinutes

select unique Ids in total_activity table

total_activity2 <- total_activity %>% distinct(Id, .keep_all=TRUE)

create a join table containing clean data about total steps and calories

library(dplyr)

total_steps <- daily_activity %>%
  select(Id, TotalSteps)

library(dplyr)

total_calories <- daily_calories %>%
  select(Id, Calories)
  total_calories[total_calories==0] <- NA
total_calories2 <- total_calories[complete.cases(total_calories),]

install.packages("dplyr")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)

library(dplyr)

steps_vs_calories <- inner_join(x = total_steps, y = total_calories2, by = "Id")

group steps_vs_calories table by Id and sum total steps and calories

steps_vs_calories2 <- aggregate(. ~ Id, # Keep all variables
          steps_vs_calories,
          sum)

Visualize the data being analyzed and calculated

Visualization stage requires some packages to be installed and loaded.

install.packages("ggplot2")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)

library(ggplot2)
install.packages("colorspace")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)

library(colorspace)
install.packages("tidyverse")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)

library(tidyverse)

line chart of users_steps

plot(users_steps$TotalSteps,type = "o", col = "black", xlab = "Users", ylab = "Steps",
     main = "Total Steps")

The table above illustrates total steps made by 33 unique customers.

line chart of users_steps with average line

ggplot(data=users_steps) +
  geom_line(mapping = aes(x=Id,y=TotalSteps)) +
  geom_hline(yintercept = mean(users_steps$TotalSteps), color="blue")+
  labs(title="Total Steps")+
  annotate("text",x=5.0e+09,y=20000,label="The blue line indicates the average number of steps")

The table above gives an information about the average number of total steps made by every customer, which is close to 10 000 steps.

line chart of total_activity table

barplot(total_activity2$sum,type = "o", col = "grey", xlab = "Users", ylab = "Minutes",
     main = "Activity Minutes")

## Warning in plot.window(xlim, ylim, log = log, ...): graphical parameter "type"
## is obsolete

## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...):
## graphical parameter "type" is obsolete

## Warning in axis(if (horiz) 1 else 2, cex.axis = cex.axis, ...): graphical
## parameter "type" is obsolete

The table above illustrates activity minutes represented by 33 unique customers. Total activity for most customers is more than 800 minutes.

pie chart with percentages

slices <- c(29, 4)
lbls <- c("Lightly Active", "Very Active")
pct <- round(slices/sum(slices)*100)
lbls <- paste(lbls, pct) # add percents to labels
lbls <- paste(lbls,"%",sep="") # ad % to labels
pie(slices,labels = lbls, col=grey.colors(length(lbls)),
    main="Pie Chart of Activity")

The pie chart above represents frequently mentioned activity statuses. In accordance with this chart 88% of customers are light users and 12% of customers are very active users.

show users by activity status

ggplot(data=steps_vs_calories2)+
  geom_point(mapping=aes(x=TotalSteps,y=Calories))+
  labs(title="Steps vs. Calories")

The table above represents a positive relationship between steps and calories. The more steps are made, the more calories are burned.

Business insights from analysis

Bellabeat is a successful company focused on producing smart products for women: Bellabeat app, Leaf, Time, Spring and Bellabeat membership.

The analysis of smart device fitness data from FitBit customers shows the insights on how users of smart trackers use their smart devices. And the following trends were identified:

the average number of total steps made by every customer is close to 10,000 steps;
activity for most customers is above 800 minutes;
88% of customers are light users and 12% of customers are very active users;
a positive relationship between steps and calories was identified.

Based on the identified trends the following changes to the marketing strategy of Bellabeat app and Time could be made in order to increase users’ activity and help unlock new growth opportunities for the company:

Most users were more frequently found to make more than 7,500 steps. I recommend rewarding users for 10,000 steps.
Bellabeat products are mostly used by very active and light users with activity more than 800 minutes. I recommend focusing on sedentary users by sending short articles on importance of daily activity in Bellabeat app. In addition I recommend sending push-up notifications on Time product when the activity seems to be low.
Activity intensity was found to be positively correlated with calories burned. I recommend set a daily indicator for calories burned.

Bellabeat Wellness Case Study

Zhanara Zeinesheva

2/14/2022

Introduction and background

Upload your CSV files to R

Install and load common packages and libraries

Load CSV files

Get the summary of the dataframe

Organize the data to make it useful

Load CSV files

Get the summary of the dataframe

Clean and organize loaded data for further analysis

Analyze and calculate clean data

Visualize the data being analyzed and calculated

Business insights from analysis