Introduction and background

This is an example case study to demonstrate my knowledge and skills in data analytics using R programming. Working as a junior data analyst for the Bellabeat Company, a high-tech company that manufactures health-focused smart products, I show my skills in analyzing and presenting smart device data to gain insight into how consumers are using their smart products.

Upload your CSV files to R

Upload CSV files to the project from the relevant data source: https://www.kaggle.com/arashnic/fitbit

In this project the following two CSV files will be used: * Daily Activity * Daily Calories

Install and load common packages and libraries

You can always install and load packages along the way as you may discover you need different packages after you start your analysis. There are several packages that could be installed and loaded at this stage.

Load CSV files

Here a dataframe named ‘daily_activity’ needs to be created and read in one of the CSV files from the dataset. In addition, the “readr” package should be installed and loaded in order to read CSV files.

install.packages("readr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
library(readr)
daily_activity <- read_csv("daily_activity.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Get the summary of the dataframe

This stage provides with a quick overview of the dataset.

head(daily_activity)
## # A tibble: 6 × 15
##        Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie…
##     <dbl> <chr>             <dbl>         <dbl>           <dbl>            <dbl>
## 1  1.50e9 4/12/2016         13162          8.5             8.5                 0
## 2  1.50e9 4/13/2016         10735          6.97            6.97                0
## 3  1.50e9 4/14/2016         10460          6.74            6.74                0
## 4  1.50e9 4/15/2016          9762          6.28            6.28                0
## 5  1.50e9 4/16/2016         12669          8.16            8.16                0
## 6  1.50e9 4/17/2016          9705          6.48            6.48                0
## # … with 9 more variables: VeryActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>

Organize the data to make it useful

install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
library(dplyr)
daily_activity %>%
  select(Id, ActivityDate, TotalSteps, Calories)
## # A tibble: 940 × 4
##            Id ActivityDate TotalSteps Calories
##         <dbl> <chr>             <dbl>    <dbl>
##  1 1503960366 4/12/2016         13162     1985
##  2 1503960366 4/13/2016         10735     1797
##  3 1503960366 4/14/2016         10460     1776
##  4 1503960366 4/15/2016          9762     1745
##  5 1503960366 4/16/2016         12669     1863
##  6 1503960366 4/17/2016          9705     1728
##  7 1503960366 4/18/2016         13019     1921
##  8 1503960366 4/19/2016         15506     2035
##  9 1503960366 4/20/2016         10544     1786
## 10 1503960366 4/21/2016          9819     1775
## # … with 930 more rows

Load CSV files

Here a dataframe named ‘daily_calories’ needs to be created and read in one of the CSV files from the dataset.

library(readr)
daily_calories <- read_csv("daily_calories.csv")
## Rows: 940 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, Calories
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Get the summary of the dataframe

A brief description of the dataset could be shown using the following code chunks.

head(daily_calories)
## # A tibble: 6 × 3
##           Id ActivityDay Calories
##        <dbl> <chr>          <dbl>
## 1 1503960366 4/12/2016       1985
## 2 1503960366 4/13/2016       1797
## 3 1503960366 4/14/2016       1776
## 4 1503960366 4/15/2016       1745
## 5 1503960366 4/16/2016       1863
## 6 1503960366 4/17/2016       1728

Clean and organize loaded data for further analysis

This stage of the data analysis is very important as it could reduce possible errors appearing during the analysis stage.

library(tidyverse)
daily_activity %>% arrange(Id)
## # A tibble: 940 × 15
##        Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie…
##     <dbl> <chr>             <dbl>         <dbl>           <dbl>            <dbl>
##  1 1.50e9 4/12/2016         13162          8.5             8.5                 0
##  2 1.50e9 4/13/2016         10735          6.97            6.97                0
##  3 1.50e9 4/14/2016         10460          6.74            6.74                0
##  4 1.50e9 4/15/2016          9762          6.28            6.28                0
##  5 1.50e9 4/16/2016         12669          8.16            8.16                0
##  6 1.50e9 4/17/2016          9705          6.48            6.48                0
##  7 1.50e9 4/18/2016         13019          8.59            8.59                0
##  8 1.50e9 4/19/2016         15506          9.88            9.88                0
##  9 1.50e9 4/20/2016         10544          6.68            6.68                0
## 10 1.50e9 4/21/2016          9819          6.34            6.34                0
## # … with 930 more rows, and 9 more variables: VeryActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>
daily_activity %>% group_by(Id, ActivityDate) %>% drop_na() %>% summarize(average_TotalSteps=mean(TotalSteps))
## `summarise()` has grouped output by 'Id'. You can override using the `.groups`
## argument.
## # A tibble: 940 × 3
## # Groups:   Id [33]
##            Id ActivityDate average_TotalSteps
##         <dbl> <chr>                     <dbl>
##  1 1503960366 4/12/2016                 13162
##  2 1503960366 4/13/2016                 10735
##  3 1503960366 4/14/2016                 10460
##  4 1503960366 4/15/2016                  9762
##  5 1503960366 4/16/2016                 12669
##  6 1503960366 4/17/2016                  9705
##  7 1503960366 4/18/2016                 13019
##  8 1503960366 4/19/2016                 15506
##  9 1503960366 4/20/2016                 10544
## 10 1503960366 4/21/2016                  9819
## # … with 930 more rows
daily_users <- daily_activity %>% distinct(Id, .keep_all=TRUE)
users_steps <- daily_users %>%
  select(Id, TotalSteps)
users_activity <- daily_users %>% 
  select(Id, VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes, SedentaryMinutes) %>% drop_na()

Analyze and calculate clean data

users_activity$Max<-pmax(users_activity$VeryActiveMinutes, users_activity$FairlyActiveMinutes, users_activity$LightlyActiveMinutes, users_activity$SedentaryMinutes)
users_status <- users_activity %>%
  select(VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes)
users_status$Largest_Column<-colnames(users_status)[apply(users_status,1,which.max)]
status <- table(users_status['Largest_Column'])
total_activity <- daily_activity %>%
  select(Id, VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes, SedentaryMinutes) %>% drop_na()
total_activity$sum <- total_activity$VeryActiveMinutes+total_activity$FairlyActiveMinutes+total_activity$LightlyActiveMinutes+total_activity$SedentaryMinutes
total_activity2 <- total_activity %>% distinct(Id, .keep_all=TRUE)
library(dplyr)

total_steps <- daily_activity %>%
  select(Id, TotalSteps)

library(dplyr)

total_calories <- daily_calories %>%
  select(Id, Calories)
  total_calories[total_calories==0] <- NA
total_calories2 <- total_calories[complete.cases(total_calories),]

install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
library(dplyr)

steps_vs_calories <- inner_join(x = total_steps, y = total_calories2, by = "Id")
steps_vs_calories2 <- aggregate(. ~ Id, # Keep all variables
          steps_vs_calories,
          sum)

Visualize the data being analyzed and calculated

Visualization stage requires some packages to be installed and loaded.

install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
library(ggplot2)
install.packages("colorspace")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
library(colorspace)
install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
library(tidyverse)
plot(users_steps$TotalSteps,type = "o", col = "black", xlab = "Users", ylab = "Steps",
     main = "Total Steps")

The table above illustrates total steps made by 33 unique customers.

ggplot(data=users_steps) +
  geom_line(mapping = aes(x=Id,y=TotalSteps)) +
  geom_hline(yintercept = mean(users_steps$TotalSteps), color="blue")+
  labs(title="Total Steps")+
  annotate("text",x=5.0e+09,y=20000,label="The blue line indicates the average number of steps")

The table above gives an information about the average number of total steps made by every customer, which is close to 10 000 steps.

barplot(total_activity2$sum,type = "o", col = "grey", xlab = "Users", ylab = "Minutes",
     main = "Activity Minutes")
## Warning in plot.window(xlim, ylim, log = log, ...): graphical parameter "type"
## is obsolete
## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...):
## graphical parameter "type" is obsolete
## Warning in axis(if (horiz) 1 else 2, cex.axis = cex.axis, ...): graphical
## parameter "type" is obsolete

The table above illustrates activity minutes represented by 33 unique customers. Total activity for most customers is more than 800 minutes.

slices <- c(29, 4)
lbls <- c("Lightly Active", "Very Active")
pct <- round(slices/sum(slices)*100)
lbls <- paste(lbls, pct) # add percents to labels
lbls <- paste(lbls,"%",sep="") # ad % to labels
pie(slices,labels = lbls, col=grey.colors(length(lbls)),
    main="Pie Chart of Activity")

The pie chart above represents frequently mentioned activity statuses. In accordance with this chart 88% of customers are light users and 12% of customers are very active users.

ggplot(data=steps_vs_calories2)+
  geom_point(mapping=aes(x=TotalSteps,y=Calories))+
  labs(title="Steps vs. Calories")

The table above represents a positive relationship between steps and calories. The more steps are made, the more calories are burned.

Business insights from analysis

Bellabeat is a successful company focused on producing smart products for women: Bellabeat app, Leaf, Time, Spring and Bellabeat membership.

The analysis of smart device fitness data from FitBit customers shows the insights on how users of smart trackers use their smart devices. And the following trends were identified:

Based on the identified trends the following changes to the marketing strategy of Bellabeat app and Time could be made in order to increase users’ activity and help unlock new growth opportunities for the company:

  1. Most users were more frequently found to make more than 7,500 steps. I recommend rewarding users for 10,000 steps.
  2. Bellabeat products are mostly used by very active and light users with activity more than 800 minutes. I recommend focusing on sedentary users by sending short articles on importance of daily activity in Bellabeat app. In addition I recommend sending push-up notifications on Time product when the activity seems to be low.
  3. Activity intensity was found to be positively correlated with calories burned. I recommend set a daily indicator for calories burned.